Re: Lex/Yacc: Reading fixed field lengths [LONG!] (Axel Belinfante)
6 Jan 1999 04:01:49 -0500

          From comp.compilers

Lex/Yacc: Reading fixed field lengths (Ted Christiansen) (1999-01-03)
Re: Lex/Yacc: Reading fixed field lengths [LONG!] (1999-01-06)
From: (Axel Belinfante)
Newsgroups: comp.compilers
Date: 6 Jan 1999 04:01:49 -0500
Organization: Univ. of Twente, Dept. of Computer Science, Tele Informatics group
References: 99-01-015
Keywords: lex, translator

  Ted Christiansen <> writes:

|> How can I setup lex/yacc to parse lines with fixed field lengths? In
|> the example below, the fields are 8 characters wide. Sometimes there is
|> white space between, sometimes not - this would cause a typically
|> configured scanner to fail. Any help would be appreciated.
|> -------1-------2-------3-------4-------5-------6-------7
|> GRID 1 0 124.-346.999 2.99999 0
|> GRID 3 0 49.9999-392.999-4.99999 0

As a coincidence, I'm working on a small 'translator' that will be
allowing (as it seems) exactely what you want. I already have some
ideas on my translator language, and on the mapping to lex and yacc
input. [The translator may end up as a part of our term-processor
kimwitu, or as a separate tool - I still have to decide.] [If a tool
capable of what I describe below already exists, pointers
  are very welcome - I have not yet started coding 'my' translator :-) ]

To give a short answer: use lex rules as suggested by our moderator.
If you need more flexibility: omit the 'repeat count' constant in the
lex regexp, so lex will read one 'item' at a time, and update the lex
action to check if a repeat count (given by the caller of yylex, using
a function given as part of the lex input) has been reached - if not,
don't return from the scanner yet, but just call yymore() to remember
what was scanned so far, and 'fall out of the action' so lex will
continue to scan - only return from the scanner once the repeat count
has been reached.

And now for the long answer (that includes sample code): For my
application I (hope I) don't need the full power of yacc and lex,
which would allow me to use a more concise language for the
translator. The idea is that I can specify patterns that consist of
tokens, where the tokens can have a certain (exact, (minimal?) or
maximal) length (or repeat-count), that can be given as an expression
(instead of the constant allowed in lex regexps) with the token. With
each token I can give a variable that will point to the string for a
token as soon as the token has been read completely, which allows me
to use in length-expression those tokens of the pattern that have read
before. [in addition, my little language will contain non-terminals,
and each pattern corresponds to a rule for one non-terminal]

For each pattern I will generate a yacc rule, and for each token a lex
rule. In the yacc rule I will call some scanner routines to inform the
scanner of the (maximum) length of the next token to be scanned (taken
from the length-expression in the pattern), and to put it in the right
scanner state (to recognise the next token). In the lex rule I will
keep track of the length of the token that is being scanned, and
compare that with the length wanted: if we have not yet reached the
length we need, I will only call yymore to 'remember' the prefix of
the token scanned so far and fall out of the lex action _without_
returning to the parser, which will cause lex to continue reading;
otherwise, I return the token to the parser. In case I have read too
much (suppose we have two fields f1 and f2 that _together_ are 8
characters long - if f1 takes all 8 characters, f2 should have length
0) I will 'unput' the character I read too much and return the token
to the parser.

[For my application I want the scanner to not even try to read one
  character more than I expect, that is why have the scanner read its
  input one character at a time, and why it scanner exports a routine
  scannedlast() that the parser can call to tell it to stop reading.

For your example the input would be something like:
(instead of the regexp . for each field, we can also use the one
  proposed by our moderator: [-0-9. ] and adapt the lex rule accordingly)

< /* pattern */
    grid = (GRID) /* length-expression appears between { and } */
    f1 = (.{8 - strlen(grid)}) /* here we can refer to grid */
    f2 = (.{8})
    f3 = (.{8})
    f4 = (.{8})
    f5 = (.{8})
    f6 = (.{8})
    f7 = (.{8})
    nl = (\n)
> -> { /* here I can put an action in which I can refer to the
                    variables introduced above

This will be mapped on lex (actually: flex) rules like the following:

static int ready = 0; /* used to sync with YY_INPUT */
static int timesread = 0; /* number of times (chars) we have read */
static int timestoread = 1; /* number of times (chars) we should read */

/* definition of YY_INPUT that reads one character at a time omitted */
%x S_PATTERN /* _exclusive_ start state */
%x S_dot /* _exclusive_ start state */
%x S_nl /* _exclusive_ start state */
yylval.yt_charstring = strdup(yytext);
return T_GRID;
<S_dot>. {
if (timesread < timestoread) {
/* not yet read enough */
} else if (timesread > timestoread) {
/* we have read too much: push back */
char *copy = strdup(yytext);
/* we should memcopy instead!!!! */
unput(copy[yyleng - 1]);
copy[yyleng - 1] = '\0';
timestoread = 1;
timesread = 0;
yylval.yt_charstring = copy; /* no strdup needed*/
return T_DOT;
} else {
/* we have read the right amount */
timestoread = 1;
timesread = 0;
yylval.yt_charstring = strdup(yytext);
return T_DOT;
<S_NL>\n {
yylval.yt_charstring = strdup(yytext);
return T_NL;
/* for parser to tell the scanner to stop */
void scannedlast()
{ ready = 1;}

void scanbegin_pattern()

void scanbegin_dot()
{ BEGIN S_dot; }

void scanbegin_nl()

... and mapped on yacc (actually: bison) rules like the following:


/* declare the scanbegin_... routines that we need */

%type <yt_whatever_you_need> pattern_continuation
%token <yt_charstring> T_GRID
%token <yt_charstring> T_DOT
%token <yt_charstring> T_NL

/* here we need a rule that says that we want to parse 0 (1?) or
      more patterns (each 'pattern' parses a line;
      we use nonterminal 'pattern_continuation' to be able to have
      an action before the first token, even when there is more than
      on production rule for 'pattern_continuation'.

: {

char *grid = $1;
scancount(8 - strlen(grid));
char *grid = $1;
char *f1 = $3;
... /* more of the same omitted */

char *grid = $1;
char *f1 = $3;
char *f2 = $5;
char *f3 = $7;
char *f4 = $9;
char *f5 = $11;
char *f6 = $13;
char *f7 = $15;
char *grid = $1;
char *f1 = $3;
char *f2 = $5;
char *f3 = $7;
char *f4 = $9;
char *f5 = $11;
char *f6 = $13;
char *f7 = $15;
char *nl = $17;
$$ = SomeRoutineOnTheFields(....);

Hope this helps,
  <> <URL:>
    University of Twente, Dept. of C.S., Formal Methods & Tools Group
  P.O. Box 217; NL-7500 AE Enschede. Phone: +31 53 4893774; Fax: +31 53 4893247

