Re: Please provide a learning path for mastering lexical analysis languages

George Neuner <gneuner2@comcast.net>
Sun, 08 May 2022 12:52:14 -0400

          From comp.compilers

Related articles
| List of all articles for this month |
From: George Neuner <gneuner2@comcast.net>
Newsgroups: comp.compilers
Date: Sun, 08 May 2022 12:52:14 -0400
Organization: A noiseless patient Spider
References: 22-05-010
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="39045"; mail-complaints-to="abuse@iecc.com"
Keywords: lex
Posted-Date: 08 May 2022 14:45:37 EDT

On Fri, 6 May 2022 11:59:17 +0000, Roger L Costello
<costello@mitre.org> wrote:


>I want to master lexical analysis.
>
>In Chris Christopher's last post he said:
>
>> [Flex] is not even as powerful as some other lexical analysis languages
>> and even exploiting its power often requires things that cannot be
>> expressed in Flex alone nor can they be done in ways that are simple
>> and easy to reason about.
>
>I am currently in the process of mastering Flex. I feel that mastering Flex
>will give me a foundation for learning other more advanced lexical analysis
>languages. "This XYZ lexical analysis language does it this way, Flex does it
>this other way. Ah, yes, I can see how XYZ's way is simpler and more
>powerful."
>
>From Chris's post I see that there is much to learn beyond Flex. Thank you
>Chris.
>
>Can you provide a learning path, please? A learning path for mastering lexical
>analysis languages.


It's not the grammars (languages) used by the tools that you need to
learn - it's the methods used the tools that are important.




>After I master Flex, what lexical analysis language should I then master? And
>after mastering that, what is the next lexical analysis language that I should
>master?
>
>/Roger


Lexical analysis is not terribly interesting ... it really just is
about tokenizing the input. There are 2 main techniques: regular
expressions, and recursive descent code. In practice, these may be
combined.


Flex essentially is regex plus a pushdown stack. The stack offers
some limited ability to recognize simple 'inclusion' languages that
regex can't handle alone, but in the Chomsky sense it does not add
very much.


You can look at ANTLR for an example of RD lexing. There are others.


The main difference between regex and RD is that regex is an unguided
method which simply changes state in the match automaton in response
to incoming characters. In contrast, the incoming character in RD
guides the path taken through the matching code: e.g.,


    if ('h' == ch)
        if ('e' == ch)
            if ('l' == ch)
                if ('p' == ch)
                    { // recognized 'help' }
                else
                if ('l' == ch)
                    :


In practice real RD code often matches whole words or word prefixes
rather than individual characters as shown here, but the principle is
the same. And sometimes regex are used to perform the word matches.






Grammatical syntax analysis - i.e. parsing - is where things get
interesting.


There are 3 typical methods of parsing: LL, LR, and PEG.


LL and PEG typically are associated with RD code. LR typically is
associated with table driven pushdown automata.
[I do recall seeing a paper in which the LR automata was /implemented/
using RD code in CPS, but naturally I can't find that paper just now.
It really just proves that there are more ways to implement FA than
many people realize 8-). I have not similarly seen an LL parser
implemented with table or graph FA/PDA, but I suspect that could be
done as well. Chris Clark probably knows better.]


You also may see references also to 'combinator' parsing, which some
people consider to be a unique method, but which more typically is
associated with PEG. IMO it is simply an implemenation technique that
is equally applicable to LL, but MMV on this point because - although
their implementation looks similar - there are some not-insignificant
technical differences between LL and PEG.




Bison is an LR based tool associated strongly with Flex. Bison offers
a few different variants of LR (LALR,SLR,GLR) that offer different
limitations on the recognized language and (where they overlap)
differences in the sizes of the generated parsers. Parser size was
much more important in the past - nowadays it's only relevant to
particular programming niches such as embedded.


ANTLR is an LL based tool that generates RD code for both parsers and
lexers. Since v3, ANTLR is LL(*) - prior to v3, ANTLR was LL(k). Both
LL(k) and LL(*) are interesting in their own right and you may wish to
look at both.
[Many LL tools only are LL(1). LL(k) and LL(*) both deal with
management of lookahead. LL(k) was created to deal with limitations
of LL(1), and LL(*) relieves the programmer of having to determine the
proper 'k' for the grammar.]


The biggest difference is in how you deal with recursive structure in
the language: in LL (or PEG) you need to right-factor your grammar, in
LR you need to left-factor instead. The factoring determines the
order in which recursive structures are recognized.




I personally have little experience with existing PEG tools ...
hopefully someone can give you some suggestions.


Combinator parsers usually are hand written using a library of
matching (or match creating) functions. Most often combinator parsers
are seen in [higher order] functional languages where the functions
easily can be strung together, but there are at least a few libraries
available for C and C++.


Hope this helps.
George


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.