Re: Terminals, non-terminal, syntax, semantics: some really naive questions

Chris F Clark <cfc@world.std.com>
14 Mar 2003 11:03:50 -0500

From comp.compilers

Related articles
Terminals, non-terminal, syntax, semantics: some really naive questi rmyers1400@attbi.com (Robert Myers) (2003-03-09)
*Re: Terminals, non-terminal, syntax, semantics: some really naive cfc@world.std.com (Chris F Clark)* (2003-03-14)**
Re: Terminals, non-terminal, syntax, semantics: some really naive arnold@skeeve.com (2003-03-14)
Re: Terminals, non-terminal, syntax, semantics: some really naive slk14@earthlink.net (SLK Parsers) (2003-03-14)
Re: Terminals, non-terminal, syntax, semantics: some really naive branco@canal13.com.br (2003-03-17)

| List of all articles for this month |

From:	Chris F Clark <cfc@world.std.com>
Newsgroups:	comp.compilers
Date:	14 Mar 2003 11:03:50 -0500
Organization:	The World Public Access UNIX, Brookline, MA
References:	03-03-039
Keywords:	parse
Posted-Date:	14 Mar 2003 11:03:50 EST

Yes, your quesy feeling will go away with familiarity, whether that
breeds disaster I will leave to your fate.

Part of the reason you have a quesy feeling about the term TERMINALS
is that it is a term that is "so key" that it acutally has a couple of
overlapping definitions.

To make things more clear, one should first understand the original
meaning of TERMINALS. They were the end points of the grammar, or as
noted, if the grammar is treated as a tree, they are the leaves of the
tree. The "goal symbol" (aka the start symbol) is the other end or
root of the tree.

How one draws a tree is irrelevant. The goal symbol is always the
root and the terminals are always the leaves. (This is by
definition.) This is connected to some very important facts about
parsing (and derivation). By the way, parse trees (aka derivation
trees) are conventionally drawn with the root symbol at the top of the
tree with the branches extending downward through the non-terminals
until reaching the leaves/terminals at the bottom of the tree.

The important fact about parsing is that is an attempt to match a
sequence of symbols (the terminals) to a particular structure. The
rules of the structure are codified in the productions of the grammar.
A properly formed sequences of symbols (called a sentence) can always
be matched by the rules such that a tree with root the goal symbol can
be formed with the terminals as leaves. (Here is that definition of
how the tree is formed again. A complete parse tree is defined as a
tree with the root being the goal symbol and the leaves being
terminals (and all of the branches matching rules of the grammar).
Any other tree is only a partial parse (or derivation) tree.)

Ok, queasy feeling again, what are the terminals? Guess what, they
are what the grammar writer decides they are. They are the sequence
of symbols where the grammar writer decided to stop (terminate). They
are the "atoms in the periodic table" that chemical compounds
(sentences) are formed from.

There is no magic to terminals, they are simply the simplest things
the grammar writer decided to express themselves in. Thus, in a
grammar about a natural lanugage, the terminals are typical different
classes of words: noun, verb, adjective, etc. In a grammar for a
programming lanugage, the terminals are the simplest elements of the
programming language: the keywords of the language, "identifiers"
(meaning the names the programmers has introduced), integers, floating
point numbers, etc. In each case, the grammar writer picks some
"natural" stopping point and says these are the atoms of my language.
Those atoms are the terminals.

The key fact you will find in a grammar is that terminals never have a
definition in the grammar and non-terminals always have a definition.
By the way, the goal symbol (or root) is the only non-terminal which
is not required to be referenced in a well-defined grammar (as it is
implicitly referenced as the starting point for constructing a tree).
And, by the way, if one starts at the goal symbol and selectively
choose definitions for the non-terminals until an entire tree (with
terminals as leaves) is built, the process is called "derivation".

That's the key definition of terminals. They are simply the end
points of where derivation ends (or in reverse the symbols that are
the begining points of a parse)--and parsing and derivation are simply
inverses of each other. (And by tradition, grammar theory begins with
talking about derivations and deriving or generating sentences, that
is building the tree starting with the goal symbol and terminating
with the terminal leaves.)

In a simple theoretical world, that would be the end of the story.
The confusion comes about from the practical world. We don't use
grammars to simply describe arbitrary symbols at some abstract level.
We use grammars in real-world situations where we are trying to parse
actual characters in text (or similar things). In this case, the end
points of the grammars "should" be the characters which can appear.

However, that doesn't jive well with the natural model of grammars.
We don't read simply the letters of English, we group these letters
into words. Similary, 12 is not simply two characters in a row, but
the compound is the integer number twelve.

In light of this, terminals are not the ultimate leaves of the
tree. They are simply the place where the grammar writer decide to
stop (and for pratical languages, the grammar writer must in fact,
supply definitions of the terminals also, although sometimes in a
different notation, e.g. LEX v. yacc).

So, in the practical world, the TERMINALS are simply where the
"parsing" grammar ends and the "lexing" grammar starts. In this
world, terminals are often referred to as TOKENS. This definition
overlaps with the purely theoretical one, and specifically does so,
because grammars are used to represent the parsing process.

Returning to our chemical analogy, the characters of the source text
could be considered the protons, neutrons, and electrons that comprise
atoms. Chemistry rarely deals with the sub-atomic particles as
entities themselves. The atoms are the fundamental building blocks of
molecules and all chemical compounds and the fact that the atoms can
be subdivided into consituent parts is irrelevant for their "chemical"
properties. Thus, it is the same with terminals. The grammar writer
defines a sets of terminals that have semantic validity (e.g. the
number 12 is a complete number in its own right and the constituent
characters 1 and 2, which although they determine the number, are not
independently semantically significant).

And, just as chemistry relates to sub-atomic physics, lexing relates
to parsing. There are rules at the lexing level that describe how
characters are grouped into token and how those tokens have semantic
significance. That is separate and distinct (although not always
entirely unrelated) to the use of the tokens as terminals in a parsing
grammar.

At the practial level this is further obfuscated by the fact that
different technologies are traditionally used at the lexical level
than the parsing level (regular expressions at the lexical level and
BNF grammars at the parsing level). In addition, one does not
generally do semantic analysis at the character level, one applies
semantics only to whole tokens.

This all goes to give the concept of TERMINAL a heightened sense of
inportance, although at its fundamental level it is simply still just
the leaves of the parse tree. However, because those leaves are
atomic, they take on a heightened sense of importance, since they are
the final things to which semantics are attached.

In most cases, the place to draw the division of groups of characters
into tokens (terminals) is obvious, as the language will have natural
semantic units (e.g. idntifiers and numbers). When that happens, life
is simple and the terminals are the same units that the user thinks in
terms of.

Unfortunately, sometimes the grammar has weird edges and corners that
for technical reasons cause the grammar to be divided at slightly
non-obvious places and life is not so good. At those times, where one
separates the parsing from the lexing is dictated by the technology
rather than by a natural conceptual boundary. The theoretical (or
reference) grammar may have one set of terminals representing the
conceptual units and the implementation grammar may have a different
set of tokens (or things which can be lexed by the technology chosen).

The experienced grammar writer often forgets this distinction between
the natural semantic groups the grammar should be composed from and
the non-natural groups that match the technology chosen for
implementation and casually refers to terminals in an ambiguous manner
that describes whatever happens to be convenient for the thought at
hand. That's because for the experienced writer the concept of
terminal is so key that it is no longer relevant--the same way one
might talk in a close cicle of friends about two people within that
circle that have the same name without bothering to distinguish which
they were referring to, since to insiders the reference is obvious.

By the way, that is often one mark of an experienced grammar writer,
knowing where the conceptual boundaries in the grammar lie and also
knowing where the implementation boundaries lie when they are not the
same. Often people will struggle with getting a particular grammar to
work because they try to do too much (or too little) in the lexing
phase. Knowing which problems are best suited to lexing and knowing
when something can or cannot be treated as atomic (and how to work
around the corner cases) is one of the marks of grammar writing
maturity.

Hope this makes things less vague,
-Chris

*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
3 Proctor Street voice : (508) 435-5016
Hopkinton, MA 01748 USA fax : (508) 435-4847 (24 hours)

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Terminals, non-terminal, syntax, semantics: some really naive questions

Chris F Clark <cfc@world.std.com>14 Mar 2003 11:03:50 -0500

Chris F Clark <cfc@world.std.com>
14 Mar 2003 11:03:50 -0500