Re: Regular expressions in lexing and parsing

Kaz Kylheku <847-115-0292@kylheku.com>
Tue, 18 Jun 2019 20:01:50 +0000 (UTC)

          From comp.compilers

Related articles
Regular expressions in lexing and parsing ed_davis2@yahoo.com.dmarc.email (Ed Davis) (2019-05-17)
Regular expressions in lexing and parsing jamin.hanson@googlemail.com (Ben Hanson) (2019-05-18)
Re: Regular expressions in lexing and parsing DrDiettrich1@netscape.net (Hans-Peter Diettrich) (2019-05-21)
Re: Regular expressions in lexing and parsing drikosev@gmail.com (Ev. Drikos) (2019-05-23)
Re: Regular expressions in lexing and parsing christopher.f.clark@compiler-resources.com (Christopher F Clark) (2019-06-17)
Re: Regular expressions in lexing and parsing quinn.jackson@ieee.org (Quinn Jackson) (2019-06-18)
Re: Regular expressions in lexing and parsing quinn.jackson@ieee.org (Quinn Jackson) (2019-06-18)
Re: Regular expressions in lexing and parsing 847-115-0292@kylheku.com (Kaz Kylheku) (2019-06-18)
Re: Regular expressions in lexing and parsing christopher.f.clark@compiler-resources.com (Christopher F Clark) (2019-06-18)
| List of all articles for this month |

From: Kaz Kylheku <847-115-0292@kylheku.com>
Newsgroups: comp.compilers
Date: Tue, 18 Jun 2019 20:01:50 +0000 (UTC)
Organization: Aioe.org NNTP Server
References: 19-06-005 19-06-008
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="20066"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, parse, comment
Posted-Date: 18 Jun 2019 16:30:24 EDT

On 2019-06-18, Quinn Jackson <quinn.jackson@ieee.org> wrote:
> On Tue, Jun 18, 2019 at 10:09 AM Christopher F Clark
><christopher.f.clark@compiler-resources.com> wrote:
>>
>> Hans-Peter Diettrich <DrDiettrich1@netscape.net> wrote:
>>
>> > I think that regular expressions are primarily usable in *formal* token
>> > grammars. In contrast to parsers I'm missing *complete* formal grammars
>> > for the tokens, including whitespace, comments, strings and other
>> > special language elements, and how these fit together.
>
> To which Christopher F Clark
><christopher.f.clark@compiler-resources.com> responded in part:
>
>> I definitely would not discount the use of formal grammars both for
>> lexing (i.e. formal regexes) and parsing and the use of tools.
>> When we did Yacc++, we specifically made our lexer syntax look like
>> our parser syntax (and used LR yacc-style productions extended with
>> regular expressions on the right hand side) to make the notation
>> formal, regular, and simple. I would call that one of our successes.
>
> When it comes to adaptive grammars, since the terminal and
> non-terminal sets can have items added and removed dynamically during
> a parse, it makes a lot of sense to treat lexing and parsing as one
> and the same thing. I would add that even AST traversal and decoration
> can be included in this family, in practice. Treating them
> orthogonally is, in my experience, less useful overall. It also makes
> it much simpler in practice to read such grammars at their level of
> intent.


Separation of parsing into tokens (delimited by a regular grammar) and
phrase structure levels (recursive grammar) has the benefit that both
levels use only a single symbol of lookahead.


If everything is lumped into one (let's call it "holistic parsing"),
long lookahead and/or backtracking is required.


For instance, if we are parsing C that way, after we have seen the
character "d", we don't know whether we are looking at an identifier
or the start of the "default" keyword, and we still don't know that
after we have seen "de", "def", ..., "defa". Only when we get to
"default", and then peek at the next character can we decide.


Even if we don't care about the lookahead technicalities, tokenizing a
language before analyzing its syntax is justifiable as a good
abstraction.


For instance, Lisp tokenizes everything into a nested data structure;
and then the actual grammar of phrases is analyzed on a data structure.
When the grammar is being analyzed, symbols are objects with an
identity; that identity is the basis for their equivalence rather than
the spelling of their names isn't the basis for their equivalence or
inequality any more.


Another way to justify tokenization is that the entities which are
delimited by tokenization correspond quite precisely to entities that
appear in the final abstract syntax tree (at least when it comes to
those which do appear). What I mean is that even if we treat "how now
brown cow" with a holistic parser, if the four chunks "how", "now",
"brown" and "cow" are then recognizable in the abstract syntax tree, we
might as well have tokenized those items in the first place, because we
have effectively done it anyway. Our language has a concept of these
tokens.


Lastly, let's look at how natural human languages have layer divisions:
phonemes, morphemes and so on. We can analyze language phonetically
without caring about higher level syntax. It would be awkward to have to
give the syntax of a sentence in terms of phonemes, rather than entities
like "VP" (verb phrase) or "COMP" (complementizer); such a view would
likely impede our understanding more than promoting it.


--
TXR Programming Lanuage: http://nongnu.org/txr
Music DIY Mailing List: http://www.kylheku.com/diy
ADA MP-1 Mailing List: http://www.kylheku.com/mp1
[That default example doesn't seem like lookahead to me. While you're
scanning the characters, you're in a merged state that might end up as
a keyword or an identifier, just like in any DFA lexer. There may
well be other places where things are messier. -John]


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.