Re: help understand Waite & Goos

=?ISO-8859-15?Q?Dr=2E_J=FCrgen_Vollmer?= <juergen@informatik-vollmer.de>
21 Jan 2003 00:10:01 -0500

          From comp.compilers

Related articles
help understand Waite & Goos rmichalski@era.pl (Rafal Michalski) (2003-01-12)
Re: help understand Waite & Goos juergen@informatik-vollmer.de (=?ISO-8859-15?Q?Dr=2E_J=FCrgen_Vollmer?=) (2003-01-21)
| List of all articles for this month |
From: =?ISO-8859-15?Q?Dr=2E_J=FCrgen_Vollmer?= <juergen@informatik-vollmer.de>
Newsgroups: comp.compilers
Date: 21 Jan 2003 00:10:01 -0500
Organization: Compilers Central
References: 03-01-061
Keywords: parse, theory
Posted-Date: 21 Jan 2003 00:10:01 EST

Rafal Michalski wrote:


> Need some help to understand Waite and Goos "Compiler Construction"
> formalism. Obviously these may be lamer's questions of computer
> science - is there more suitable group? - but from mathematician's
> point of view, things go unclear at EBNF introduction in 5.1.4.
>
> First, it appears to me in Fig. 5.8
> as if the set T was the sum of the following sets:
> - set of identifiers
> - set of literals
> - {is, or, lpn, rpn, lbk, rbk, plus, star, period, separator}
> instead of being sum of sets
> - {identifier} (one-element set)
> - {literal} (one-element set)
> - {is, or, lpn, rpn, lbk, rbk, plus, star, period, separator}
> ?
>
> Second, what is "set of unique identifiers", how it can be disjoint
> from the set of identifiers an literals appearing in the
> specification? Please help me see both sets L, I in Fig. 5.8c.
>
> Greats
> Rafal Michalski


T is the set of the symbols "identifier", "literal", "is", etc.
But "identifier" itself is an abstractions of real identifiers like foo,
bar, sum, etc. "literal" is the abstraction of real literals like "1",
"some text", etc.


For sake of completenes you may say
      identifier ::= letter |
                                    identifier letter |
                                    identifier digit |
      digit ::= "0" | ... | "9"
      letter ::= "a" | ... | "z" | "A" | ... | "Z" | "_"


But when discussing a compiler those symbols (identifier, literal, is, ...)
are called "tokens" which are accepted by the "scanner" using a finite
state automaton (see chapter 5.2 in Goos&Waite). The scanner accepts the
bytes of the source (coded let's say in ASCII) and transforms sequences of
ASCII-characters into tokens.
The syntacical structure of the source is processed by the
so called "parser", some kind of stack automaton.


If the program source contains the following computation:
        sum := sum+ 1 ;
the token sequence is:
        identifier assign identifier plus integer_literal semicolon
Note: the blanks contained in the source are "removed" and not represented
in the token sequence.


The parser deals now only with such tokens and not with the ASCII
representations of them.


That's done for performance (finite automatons are faster than stack ones)
and for simplyfing the tasks (the scanner handles comments, whitespace,
newlines etc.)


The set of unique identifiers is now the set of all "real" identifiers as
they occur in the source program.


I hope this clarifies a little bit.


Jürgen


PS: don't hesitate to ask (me) more.


--
Dr.rer.nat. Juergen Vollmer, Viktoriastrasse 15, D-76133 Karlsruhe
Tel: +49(721) 9204871 Fax: +49(721) 24874
juergen@informatik-vollmer.de,vollmer@cocolab.de,Juergen.Vollmer@acm.org
www.informatik-vollmer.de


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.