Re: Languages with optional spaces

Kaz Kylheku <493-878-3164@kylheku.com>
Wed, 26 Feb 2020 08:06:04 +0000 (UTC)

From comp.compilers

Related articles
Languages with optional spaces maury.markowitz@gmail.com (Maury Markowitz) (2020-02-19)
Re: Languages with optional spaces awanderin@gmail.com (Jerry) (2020-02-20)
Re: Languages with optional spaces drikosev@gmail.com (Ev. Drikos) (2020-02-23)
Re: Languages with optional spaces maury.markowitz@gmail.com (Maury Markowitz) (2020-02-25)
Re: Languages with optional spaces maury.markowitz@gmail.com (Maury Markowitz) (2020-02-25)
Re: Languages with optional spaces martin@gkc.org.uk (Martin Ward) (2020-02-25)
*Re: Languages with optional spaces 493-878-3164@kylheku.com (Kaz Kylheku)* (2020-02-26)**
Re: Languages with optional spaces awanderin@gmail.com (awanderin) (2020-02-26)
Re: Languages with optional spaces drikosev@gmail.com (Ev. Drikos) (2020-02-28)
Re: Languages with optional spaces christopher.f.clark@compiler-resources.com (Christopher F Clark) (2020-02-29)
Re: Languages with optional spaces drikosev@gmail.com (Ev. Drikos) (2020-02-29)
Re: Languages with optional spaces DrDiettrich1@netscape.net (Hans-Peter Diettrich) (2020-03-01)
Re: Languages with optional spaces christopher.f.clark@compiler-resources.com (Christopher F Clark) (2020-03-01)
[8 later articles]

| List of all articles for this month |

From:	Kaz Kylheku <493-878-3164@kylheku.com>
Newsgroups:	comp.compilers
Date:	Wed, 26 Feb 2020 08:06:04 +0000 (UTC)
Organization:	Aioe.org NNTP Server
References:	20-02-015
Injection-Info:	gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="4321"; mail-complaints-to="abuse@iecc.com"
Keywords:	lex, Basic, history
Posted-Date:	27 Feb 2020 17:33:44 EST

On 2020-02-19, Maury Markowitz <maury.markowitz@gmail.com> wrote:
> I'm trying to write a lex/yacc (flex/bison) interpreter for classic BASICs
> like the original DEC/MS, HP/DG etc. I have it mostly working for a good chunk
> of 101 BASIC Games (DEF FN is the last feature to add).
>
> Then I got to Super Star Trek. To save memory, SST removes most spaces, so
> lines look like this:
>
> 100FORI=1TO10
>
> Here's my current patterns that match bits of this line:
>
> FOR { return FOR; }
>
> [:,;()\^=+\-*/\<\>] { return yytext[0]; }
>
> [0-9]*[0-9.][0-9]*([Ee][-+]?[0-9]+)? {
> yylval.d = atof(yytext);
> return NUMBER;
> }
>
> "FN"?[A-Za-z@][A-Za-z0-9_]*[\$%\!#]? {
> yylval.s = g_string_new(yytext);
> return IDENTIFIER;
> }
>
> These correctly pick out some parts, numbers and = for instance, so it sees:
>
> 100 FORI = 1 TO 10
>
> The problem is that FORI part. Some BASICs allow variable names with more than
> two characters, so in theory, FORI could be a variable. These BASICs outlaw
> that in their parsers; any string that starts with a keyword exits then, so
> this would always parse as FOR. In lex, FORI is longer than FOR, so it returns
> a variable token called FORI.
>
> Is there a way to represent this in lex? Over on Stack Overflow the only
> suggestion seemed to be to use trailing syntax on the keywords, but that
> appears to require modifying every one of simple patterns for keywords with
> some extra (and ugly) syntax. Likewise, one might modify the variable name
> pattern, but I'm not sure how one says "everything that doesn't start with one
> of these other 110 patterns".

Two ideas:

1. Just forget recognizing variable names in the lexer. Instead,
recognize only the constituent letter of a variable name in the lexer.
Then in the parser, have a grammar production which converts
the letters of a variable into a variable.

      variable : VARCHAR
                        | variable VARCHAR
                        ;

2. Use regex patterns in the lexer to recognize just the keywords,
as a above. Then, recognition of variable names is handled by
matching just one letter A-Z, whose lex action performs ad-hoc
lexical analysis using C logic. At that point you know that you do not
have a keyword, because no keyword rule matched. You can read
characters using YYIN and accumulate a variable name.

A variant of technique (2) is used for scanning C comments,
as an alternative to an ugly regular expression:

    "/*" {
                    int c;

                    while ((c = yyinput()) != 0)
                    {
                        if (c == '\n') {
                            /* increment line number or something */
                        }
                        else if (c == '*')
                        {
                            if ((c = yyinput()) == '/')
                                break;
                            else
                                unput(c);
                        }
                    }
                }

The above is an adaptation of something from an old Flex manual.
IIRC the Dragon Book has a similar example of ad-hoc logic
in a lex rule for handling C comments.

You can see that it's a similar idea. We use a regex to partially match
the comment, just the /* opening. Then we take over from there.

I have a hunch this would work for fetching variables like FORI, when
there is no match on a keyword like FOR.

--
TXR Programming Lanuage: http://nongnu.org/txr
Music DIY Mailing List: http://www.kylheku.com/diy
ADA MP-1 Mailing List: http://www.kylheku.com/mp1

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Languages with optional spaces

Kaz Kylheku <493-878-3164@kylheku.com>Wed, 26 Feb 2020 08:06:04 +0000 (UTC)

Kaz Kylheku <493-878-3164@kylheku.com>
Wed, 26 Feb 2020 08:06:04 +0000 (UTC)