Re: re2c-1.0 released!

Kaz Kylheku <>
Sun, 3 Sep 2017 15:14:29 +0000 (UTC)

          From comp.compilers

Related articles
[2 earlier articles]
Re: re2c-1.0 released! (Kaz Kylheku) (2017-09-02)
Re: re2c-1.0 released! (2017-09-02)
Re: re2c-1.0 released! (George Neuner) (2017-09-02)
Re: re2c-1.0 released! (Ulya Trofimovich) (2017-09-03)
Re: re2c-1.0 released! (Ben Hanson) (2017-09-03)
Re: re2c-1.0 released! (Ben Hanson) (2017-09-03)
Re: re2c-1.0 released! (Kaz Kylheku) (2017-09-03)
Re: re2c-1.0 released! (Ulya Trofimovich) (2017-09-03)
Re: re2c-1.0 released! (Ulya Trofimovich) (2017-09-03)
Re: re2c-1.0 released! (Ben Hanson) (2017-09-04)
Re: re2c-1.0 released! (Ulya Trofimovich) (2017-09-08)
| List of all articles for this month |

From: Kaz Kylheku <>
Newsgroups: comp.compilers
Date: Sun, 3 Sep 2017 15:14:29 +0000 (UTC)
Organization: NNTP Server
References: 17-08-007 17-09-001 17-09-003
Injection-Info:; posting-host=""; logging-data="40873"; mail-complaints-to=""
Keywords: lex, design
Posted-Date: 03 Sep 2017 12:04:34 EDT

On 2017-09-02, Anton Ertl <> wrote:
> Kaz Kylheku <> writes:
>>Briefly, why would you do some hacky regex thing in lex with \1, \2,
>>\3, when in the level immediately above yylex() you have proper phrase
>>recognition, with $1, $2, $3.
> These don't do the same thing. Taking your example from the other
> posting, do you want to recognize
> 2017
> -08
> /* bla bla */ -
> 28
> as date? If so, do it at the parser level, if not, the scanner.

We face this choice only because we put a hack into the lexer:
when it scans whitespace token or comment token, it throws it away,
so the parser doesn't see it.

That hack is very easy to implement and considerably simplifies the

We can very easily implement another hack to allow us to recognize
2017-08-28 as three tokens without comments or whitespace.

In fact we can do it all in standard lex.

We can set up a custom YYINPUT which allows the entire yytext to be
pushed back into the stream. The rest is done with orchestration of one
or more start states.

The steps to recognizing a date would then be:

1. First NNNN-NN-NN is scanned as a token.
2. push_back_string(yytext) is called
3. BEGIN(DATE) is invoked
4. The rule doesn't return so the lexer re-scans the pushed input
      in the DATE state.
5. In the DATE state, an integer token is recognized; everything
      else is an error. Dash tokens can be recognized and returned,
      or consumed in the lexer.
6. Successful recognition of an integer token in the DATE state
      returns not an INTEGER to the parser but a DINTEGER.
7. The parser's phrase structure for matching dates refers to DINTEGER
      nonterminals and not INTEGER.

Pseudo-code, including a mechanism for returning to the INITIAL state
without parser involvement:



<INITIAL,DATE>[0-9]+ {
    yylval.value = str_to_int(yytext); /* our function */

    return '-';

[0-9]+-[0-9]+-[0-9]+ {
    unput('!'); /* end signal */
    unput_string(yytext); /* our function; works with our YYINPUT */

<DATE>! { /* our end signal: include other states here */

<DATE>. {
    /* internal error: how did we get here? */

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.