Re: A simpler way to tokenize and parse?

Kaz Kylheku <864-117-4973@kylheku.com>
Sun, 26 Mar 2023 01:17:40 -0000 (UTC)

From comp.compilers

Related articles
A simpler way to tokenize and parse? costello@mitre.org (Roger L Costello) (2023-03-24)
Re: A simpler way to tokenize and parse? mal@wyrd.be (Lieven Marchand) (2023-03-25)
*Re: A simpler way to tokenize and parse? 864-117-4973@kylheku.com (Kaz Kylheku)* (2023-03-26)**
Re: A simpler way to tokenize and parse? spibou@gmail.com (Spiros Bousbouras) (2023-03-26)
Re: A simpler way to tokenize and parse? christopher.f.clark@compiler-resources.com (Christopher F Clark) (2023-03-26)
Re: A simpler way to tokenize and parse? 864-117-4973@kylheku.com (Kaz Kylheku) (2023-03-26)
Re: A simpler way to tokenize and parse? tkoenig@netcologne.de (Thomas Koenig) (2023-03-27)

| List of all articles for this month |

From:	Kaz Kylheku <864-117-4973@kylheku.com>
Newsgroups:	comp.compilers
Date:	Sun, 26 Mar 2023 01:17:40 -0000 (UTC)
Organization:	A noiseless patient Spider
References:	23-03-011
Injection-Info:	gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="33532"; mail-complaints-to="abuse@iecc.com"
Keywords:	Lisp, syntax
Posted-Date:	26 Mar 2023 05:16:09 EDT

On 2023-03-24, Roger L Costello <costello@mitre.org> wrote:
> Example of tokenizing/parsing using read:
>
> (+ 3 4) --> read --> (list `+ 3 4) --> parse --> (add (num 3) (num 4))

You've not quite hit upon how it works, and I'd encourage you to keep
exploring.

Read takes the seven characters (+ 3 4) and returns an object
which stands for the same thinig. When Lisp programmers discuss
that object, they refer to it using the same notation (+ 3 4).

Actual copy-paste from a Lisp session:

    [1]> (read-from-string "(+ 3 4)")
    (+ 3 4) ;
    7

The second return value of read-from-string, 7, isn't the
value of the expression; it's the position of the first
character of the string which was not read. Our expression
is seven characters long.
>
> The first expression (+ 3 4) is the concrete syntax.
> The middle expression (list `+ 3 4) is an s-expression. It is an intermediate
> representation.

"S-expression" actually refers to the character syntax. The object
in memory is just an expression.

The reader in Lisps like Scheme and Common Lisp perpetrates no such
embellishment. The symbol "list" and quotation around the + will not
appear from reading "(+ 3 7)". You get a three-element list, made up out
of three cons cells (pair-like objects), whose elements are strictly
those that are implied by the read syntax: the + symbol and the two
numbers.

> The last expression (add (num 3) (num 4)) is the abstract syntax.

No such thing is user-visible in any mainstream Lisp. Lisp interpreters
directly evaluate the (+ 3 4) object.

Lisp compilers potentially build some annotated syntax tree, but
this is not a documented feature of any Lisp that I know; it will be
an internal matter.

Compiling the raw (+ 3 4) form is perfectly possible.
>
> The book says: read is one of the great ideas of computer science. It helps
> decompose a fundamentally difficult process - generalized parsing of the input
> stream - into two simple processes:
>
> (1) reading the input stream into an intermediate representation
> (2) parsing that intermediate representation

The bigger idea in Lisp is actually "print-read consistency": that
objects have a printed notation that the machine can produce, which the
machine can read to reproduce a similar object.

Not all objects have print-read consistency in Lisp, but things are
usualy strict int he mature Lisp dialects. If something doesn't have
print-read consistency, it will print in an unreadable form that
generates an error.

In Common Lisp, the character sequence #< (sharpsign less-than),
in the standard read-table, signals an error. Objects which
don't have a printed notation that can be read can use that
syntax, e.g. #<socket-handle 10.1.2.3:8080>.

> I've read several compiler books and none of them talked about this. They talk
> about creating a lexer to generate a stream of tokens and a parser that
> receives the tokens and arranges them into a tree data structure. Why no
> mention of the "crown jewel" of tokenizing/parsing? Why no mention of "one of
> the great ideas of computer science"?

It's because we are not in a branch of the parallel universe in which a
lot of people know about and program in Lisp.

The Lisp microcosm has a lot to say on many topics, but computing
is largely ignorant of it.

> I have done some work with Flex and Bison and recently I've done some work
> with building parsers using read. My experience is the latter is much easier.
> Why isn't read more widely discussed and used in the compiler community?
> Surely the concept that read embodies is not specific to Lisp and Scheme,
> right?

S-expressions do crop up outside of Lisp.

The IMAP4 protocol uses them.

The GNU C compiler uses a form of S-expression internally.
Look up RTL:

https://gcc.gnu.org/onlinedocs/gccint/RTL.html#RTL

The Rational Rose object design tool stores files in a S-expression
format called Petal.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: A simpler way to tokenize and parse?

Kaz Kylheku <864-117-4973@kylheku.com>Sun, 26 Mar 2023 01:17:40 -0000 (UTC)

Kaz Kylheku <864-117-4973@kylheku.com>
Sun, 26 Mar 2023 01:17:40 -0000 (UTC)