Preview chapter about the structured syntax definition of Seed7

thomas.mertes@gmx.at
Tue, 2 Dec 2008 22:59:23 -0800 (PST)

From comp.compilers

Related articles
*Preview chapter about the structured syntax definition of Seed7 thomas.mertes@gmx.at* (2008-12-02)**

| List of all articles for this month |

From:	thomas.mertes@gmx.at
Newsgroups:	comp.compilers
Date:	Tue, 2 Dec 2008 22:59:23 -0800 (PST)
Organization:	Compilers Central
Keywords:	design
Posted-Date:	04 Dec 2008 14:58:55 EST

Hello,
I am writing a chapter about the structured syntax definition
of Seed7. It would be nice to get some feedback about it before
the final release. Here is it (read it with a monospaced font):

9. STRUCTURED SYNTAX DEFINITION
===============================

        Many languages have predefined constructs like statements and
    operators. This constructs have fixed syntax and semantics. A
    natural language or some meta language is used to describe the
    syntax and semantics of the constructs. Mostly the programmers
    are not allowed to define new syntactic constructs. But sometimes
    a full description of the syntax and semantics of a construct
    written in the programming language itself is helpful. This
    avoids the need to use ambiguous natural language descriptions
    and it is also unnecessary to learn an additional meta language.
    A formal description of new constructs in the language itself
    gives also the opportunity to define new constructs. Note that a
    compiler-compiler does not offer this opportunity and has also a
    meta language.

        There are different notations to specify the syntax of
    programming languages. The syntax statements of Seed7 use a
    notation called Seed7 Structured Syntax Description (abbreviated
    with S7SSD). The Backus-Naur Form (BNF) and its variants like the
    Extended Backus-Naur Form (EBNF) are another example of such a
    syntax specification. Since it is easier to understand new
    concepts when they are compared to well known concepts, the EBNF
    will be used as base to explain the S7SSD.

9.1 The Extended Backus-Naur Form

        As the name says the Extended Backus-Naur Form is an extension
    of the BNF. The extensions allow the definition of repetitions and
    optional parts without the use of recursive definitons. The EBNF
    has the following elements:

          - Nonterminal symbols consist of lower case letters and
              underline characters.
          - Terminal symbols are quoted strings or names in upper case
              characters which describe unprintable characters (control
              characters).
          - The concatenation of nonterminal and/or terminal symbols is
              described by writing them in sequence.
          - With | two alternatives can be separated.
          - Expressions of the extended Backus-Naur form can be put under
              parenthesis.
          - When a subexpression is optional it is enclosed in squared
              brackets [ ... ] .
          - When a subexpression may be omitted or repeated it is
              enclosed in curly braces { ... } .

    The syntax of the extended Backus-Naur form can be described with
    extended Backus-Naur form.

        syntax_description ::=
            { statement } .

        statement ::=
            identifier '::=' expression '.' .

        expression ::=
            term { '|' term } .

        term ::=
            factor factor .

        factor ::=
            identifier | string | '(' expression ')' |
            '[' expression ']' | '{' expression '}' .

9.2 The syntax of a statement

        To explain the Seed7 Structured Syntax Description we propose
    that a new statement, the 'loop' statement, should be defined. A
    'loop' statement is similar to 'while' and 'repeat' loops but
    instead of having the conditional exit at the beginning or at the
    end it's conditional exit is in the middle of the loop. This
    middle conditional exit is part of the 'loop' statement. Therefore
    it should not be confused with the 'exit' statement that exists in
    some programming languages. An example of the 'loop' statement is:

        loop
            ch := getc(inFile);
        until ch = '\n' do
            stri &:= str(ch);
        end loop;

    An equivalent solution without the usage of the 'loop' statement
    would be:

        repeat
            ch := getc(inFile);
            if ch <> '\n' then
                stri &:= str(ch);
            end if;
        until ch = '\n';

    The S7SSD of the 'loop' statement is:

        $ syntax expr: .loop.().until.().do.().end.loop is -> 25;

    For now we concentrate at the heart of the S7SSD, the expression:

        .loop.().until.().do.().end.loop

    For the purpose of the syntax description we can just remove the
    dots, which gives:

            loop () until () do () end loop

    With EBNF the 'loop' statement can be described as:

        loop_statement ::=
            'loop'
                statement
            'until' expression 'do'
                statement
            'end' 'loop' .

    An EBNF description may use many nonterminal symbols such as
    'statement' or 'expression'. The S7SSD does not distinguish
    between different nonterminal symbols. Instead the S7SSD only
    knows one nonterminal symbol: ()

    Therefore the S7SSD cannot distinguish between 'statement',
    'expression' or something else. At the syntax level any kind of
    expression can by substituted for a S7SSD nonterminal symbol ().
    With EBNF it is possible to describe constraints such as the type
    of an expression. The S7SSD relies on semantic checks to verify
    such constraints. An expression like

        loop
            35
        until 1+2 do
            integer
        end loop

    would be legal given the S7SSD of the 'loop' statement:

        $ syntax expr: .loop.().until.().do.().end.loop is -> 25;

    The definition of the semantic of the 'loop' statement is:

        const proc: loop
                                    (in proc: statements1)
                                until (ref func boolean: condition) do
                                    (in proc: statements2)
                                end loop is func
            local
                var boolean: exitLoop is FALSE;
            begin
                repeat
                    statements1;
                    if not condition then
                        statements2;
                    else
                        exitLoop := TRUE;
                    end if;
                until exitLoop;
            end func;

    This semantic definition determines which types are accepted
    between the keywords. For the 'loop' example with the semantic
    errors (see above) you would get an error message like:

        *** chkloop.sd7(35):51: Match for {loop 35 until {1 + 2 } do
integer end loop } failed

9.3 Priority and assoziativity

        When a syntax construct has parameters before the first symbol
    or after the last symbol the priority and the associativity of the
    construct are significant. Constructs with stronger priority bind
    their parameters earlier than constructs with weaker priority. The
    priority is described by a natural number (inclusive 0). The
    strongest priority is 0. Weaker priorities are described by larger
    numbers. What bind means is can be declared with an example:

                                                                              =
        A + B = C * D / \
                                                                        / \
        * priority 6 + *
        + priority 7 / \ / \
        = priority 12 A B C D

    First the * takes its parameters, then the + and at last the =
    follows.

    The associativity describes, in which order constructs with equal
    priority bind their parameters. For example

        A - B - C

    can be interpreted in two ways:

        (A - B) - C or A - (B - C)

    There are four associativities possible:
                                                                                                      Symbol

        Binding from left to right ->

        Binding from right to left <-

        Neither the left nor the right parameter
        are allowed to have the same priority <->

        At the left side there is a binding from
        left to right and at the right side there
        is a binding from right to left -><-

    The last two possibilities give no legal interpretation in the
    subtraction example. The third kind of assiciativity ( <-> ) is
    used by the equal operator ( = ) of Pascal because there a
    expression like

        A = B = C

    is not legal.

    There is a second way to describe the associativity. The
    associativity describes if an operand must have a stronger
    priority than the priority of the operator. For example:

                                                          - 7
        A - B - C / \ / \
                                                    / \ <=7 / \ <7
        - priority 7 -> / \ / \
                                                - C 7 0
                                            / \ / \
                                          / \ <=7 / \ <7
                                        / \ / \
                                      A B 0 0

    The numbers in the nodes of the right tree show the priority of
    each sub expression (sub tree). With < and <= the required
    condition for the priority of an operand is described. An
    interpretation is legal if all this conditions are met. If there
    are more than one legal interpretations or no legal
    interpretation the expression is illegal.

    Table for the possibilities of associativity:

        +---------------+------------------------------+
        | associativity | The priority of the |
        +---------------+--------------+---------------+
        | | left operand | right operand |
        | | must be | must be |
        +---------------+--------------+---------------+
        | -> | <= | < |
        | <- | < | <= |
        | <-> | < | < |
        | -><- | <= | <= |
        +---------------+--------------+---------------+
        | | than that of the operator |
        +---------------+------------------------------+

    The parameter before the operator symbol is called left operand.
    The parameter after the last symbol of a construct is called
    right operand. In case of normal operators the last symbol of a
    construct and the operator symbol are identical. If this is not
    the case there is a third kind of operand. Between the operator
    symbol and the last symbol of a construct are the middle operands.
    Middle operands can have any priority.

9.4 The syntax of operators

        The S7SSD treats everything as operator description. Operators
    have priority and assoziativity. The priority and associativity
    determine in which succession the S7SSD syntax rules get applied.
    To explain priority and assoziativity we use the basic arithmetic
    operations (+,-,*,/). To describe them with EBNF you can write:

        factor :=
            number | name .

        expression_5 ::=
            factor |
            ( '+' expression_5 ) |
            ( '-' expression_5 ) .

        expression_6 ::=
            expression_5 |
            ( expression_6 '*' expression_7 ) |
            ( expression_6 '/' expression_7 ) .

        expression_7 ::=
            expression_6 |
            ( expression_7 '+' expression_6 ) |
            ( expression_7 '-' expression_6 ) .

    This describes the following things:

          - The operators have different priorities:
                - Plus and minus signs are executed first
                - Multiplication and division are executed second.
                - Addition and subtraction are executed last.
          - This priorities are exactly what you expect from an
              arithmetic expression.
          - Additionally you see that ++2 is allowed and interpreted as
              +(+(2)) which means that the plus sign is a right-associative
              operator.
          - You can also see that a*b*c is allowed and interpreted as
              (a*b)*c which means that the multiplication is a
              left-associative operator.

    All this things can also be described with S7SSD:

        $ syntax expr: . + .() is <- 5;
        $ syntax expr: . - .() is <- 5;
        $ syntax expr: .(). * .() is -> 6;
        $ syntax expr: .(). / .() is -> 6;
        $ syntax expr: .(). + .() is -> 7;
        $ syntax expr: .(). - .() is -> 7;

    As you can see the S7SSD is shorter as the description with EBNF.
    A syntax statement is expained as follows:

          - The $ is used to introduce all hardcoded statements.
          - The keyword 'syntax' introduces a structured syntax
              description.
          - The result of the recognized expression will have
              the type 'expr'. The type 'expr' is used between the syntax
              and the semantic analysis. The type 'expr' describes
              expressions which are syntactically analyzed but not
              semantically analyzed. After the semantic analysis (and
              during the runtime) the type 'expr' is not used.
          - The colon ':' is used as separator between type and syntax
              description.
          - A dot expression like '.(). * .()' is introduced (as can
              probably be guessed by the name) with a dot. For the purpose
              of the syntax description you can just remove the dots in
              your mind: '() * ()'
          - The symbol 'is' is used in all Seed7 declarations as
              separator between the name and the value.
          - The associativity is described with one of the symbols
              -> (left-associative), <- (right-associative),
              <-> (not associative) and -><- (both associativitys).
              When there are no left or right operands, as it is the case
              for the 'loop' statement, the associativity is irrelevant.
          - Finally the priority of the syntax construct is defined with
              a integer literal like '6'. The priority '6' is used for the
              operators '*', '/', 'div', 'rem', 'mdiv' and 'mod'.

    The S7SSD can also be easily used to do automatic syntax
    recognition.

    There are also other things which are out of the scope of the
    S7SSD. The syntax of tokens (whitespace, comments, identifiers and
    literals) and expressions (parentheses, function calls and dot
    expressions) is hardcoded. The hardcoded constructs are described
    in chapter 10 (Tokens) and chapter 11 (Expressions).

    For the reasons mentioned above it is not possible to transform
    every EBNF syntax description into S7SSD. Transforming S7SSD
    descriptions to EBNF is always possible.

    The advantage of the S7SSD lies in its simplicity and that a fast
    automated syntax recognition algorithm can be easily implemented.
    It is exactly the combination of hardcoded syntax recognition and
    flexible syntax rules that make it successful.

9.5 Comparison of EBNF and S7SSD

    In the S7SSD of the 'loop' statement

        $ syntax expr: .loop.().until.().do.().end.loop is -> 25;

    are no nonterminal expressions '()' before the first keyword or
    after the last keyword. Therefore the assoziativity does not play
    any role. The nonterminal expressions '()' of the 'loop' statement
    are all surrounded by keywords and therefore they can have any
    priority. As priority of the 'loop' 25 is choosen just because
    most other statements have also priority 25. The assignments
    (:= +:= *:= ...) have priority 20 and all operators used in
    arithmetic, boolean and string expressions have priorities less
    than 20. BTW: The semicolon operator (;) is defined with the
    priority 50. Operators with a priority of 0 get their parameters
    before operators with priority 1 and so on.

    The corresponding EBNF description of the 'loop' statement would
    be:

        expression_25 ::=
            'loop'
                expression_127
            'until' expression_127 'do'
                expression_127
            'end' 'loop' .

    You must keep in mind that alternative rules for expression_25 are
    also possible and that for every priority level a rule like

        expression_127 ::= expression_126 .

    is defined. Additionally the following rules are defined:

        expression_0 ::= token | parentheses_expression |
            call_expression | dot_expression .

        token ::=
            identifier | literal .

        parentheses_expression ::=
            '(' expression_127 ')' .

        call_expression ::=
            expression_127 [ '('
            [ expression_127 { ',' expression_127 } ]
            ')' ] .

        dot_expression ::=
            [ '.' ] call_expression { '.' call_expression } .

    When you want to use some special syntax which should be only
    allowed at some place you do the following:

          - Define the special syntax with S7SSD in a way that does not
              contradict with the rest of the syntax definitions.
          - Use semantic definitions to make sure that this syntax
              construct can only be used at the place desired.

    The 'elsif' and 'else' parts of the 'if' statement use this
    technic. The syntax of the 'elsif' and 'else' statement parts is:

        $ syntax expr: .elsif.().then.() is <- 60;
        $ syntax expr: .elsif.().then.().() is <- 60;
        $ syntax expr: .else.() is <- 60;

    The types 'ELSIF_RESULT' and 'ELSIF_PROC' are just defined to be
    usable for the 'elsif' and 'else' parts. A special variant of the
    'if' statement is defined with a parameter of type 'ELSIF_PROC':

        $ syntax expr: .if.().then.().().end.if is -> 25;

        const proc: if (in boolean param) then
                                    (in proc param)
                                (in ELSIF_PROC param)
                                end if is action "PRC_IF_ELSIF";

    With 'action' the primitive actions are used (instead of a high
    level definition written in Seed7). Primitive actions are
    explained here: http://seed7.sourceforge.net/manual/actions.htm
    The semantic part of the 'else' and 'elsif' part are defined with:

        const ELSIF_PROC: elsif (in boolean param) then
                                                (in proc param) is action "PRC_IF";
        const ELSIF_PROC: elsif (in boolean param) then
                                                (in proc param)
                                            (in ELSIF_PROC param) is action "PRC_IF_ELSIF";
        const ELSIF_PROC: else
                                                (in void param) is ELSIF_EMPTY;

    Since no other functions of type 'ELSIF_PROC' are defined only
    legal 'if' statements can be written.

    An explanation of the 'if' statement is here
    http://seed7.sourceforge.net/manual/stats.htm#if-statement

================================

Thanks in advance for your effort.

Greetings Thomas Mertes

Seed7 Homepage: http://seed7.sourceforge.net
Seed7 - The extensible programming language: User defined statements
and operators, abstract data types, templates without special
syntax, OO with interfaces and multiple dispatch, statically typed,
interpreted or compiled, portable, runs under linux/unix/windows.

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Preview chapter about the structured syntax definition of Seed7

thomas.mertes@gmx.atTue, 2 Dec 2008 22:59:23 -0800 (PST)

thomas.mertes@gmx.at
Tue, 2 Dec 2008 22:59:23 -0800 (PST)