Speed of Lex tokenizers

decvax!boulder!scotty!waite (William Waite )
Sun, 17 May 87 17:26:14 MDT

From comp.compilers

Related articles
*Speed of Lex tokenizers decvax!boulder!scotty!waite* (1987-05-17)**

| List of all articles for this month |

Date:	Sun, 17 May 87 17:26:14 MDT
From:	decvax!boulder!scotty!waite (William Waite )

A paper entitled "Tuning UNIX Lex" was presented at the winter Usenix
Conference by Van Jacobson of the Lawrence Berkeley Laboratory. The paper
itself was not published in the conference proceedings, but the abstract
stated that a factor of ten improvement in the performance of a
LEX-generated scanner was attained. The abstract also claimed that when the
generated scanner was "tested against the hand-coded scanners described in
[1] and [2], the new Lex was always faster". Reference 2 was to my paper in
Software-Practice and Experience (May, 1986, 473-488), so I was quite
interested in this claim.

I was unable to obtain a copy of the paper itself, so I decided to conduct
another experiment in an effort to understand the discrepancy. Instead of
attempting to modify LEX, I played its role myself: I took the
specification for Pascal basic symbols and carefully constructed a scanner
by hand, following the general finite-state model. I then coded this
scanner in C and compared its performance with that of the scanner
described in the SP&E paper.

Each processing program was applied to the distributed version of the
Pascal P4 compiler, a large (4338 lines, 166002 characters) Pascal program.
About half (86971 characters) of the program consists of comments and white
space, 26% is identifiers and keywords, and 3% appear in integers and
string constants. There are no real numbers. Additional details can be
found in the SP&E paper.

All tests were made on a SUN3/75, running under version 3.2 of the Unix
operating system. There were no other users on this machine at the time the
tests were run. It is a diskless workstation, and no attempt was made to
control for other users on the network. "Getrusage(2)" was invoked after
all initialization had taken place, and then again after the program's loop
had been completed. The difference between those measurements is reported
in units of seconds. Each value is the mean of ten trials, and an unbiased
estimate of the population standard deviation is provided.

10-trial mean Standard Dev.

Scanner from SP&E paper: 0.946 sec 0.041
Hand-coded LEX scanner: 1.398 sec 0.061

Analysis of variance showed that the differences between the two scanners
were statistically significant at the 99.9% level. Both of the algorithms
used the input routine described in the SP&E paper; for comparison, the
time to read the same data using the unix library routine getc was 0.792
seconds.

My results are exactly the opposite of those reported by Jacobson, and I am
at a loss to explain his. The remainder of this note analyzes the code used
in the two scanners, and shows that the results are the result of
fundamental differences in the way the algorithms are structured, rather
than any table optimization that one might be able to perform.

About 88% of the characters in P4 are spaces, identifiers, keywords and
comments. The scanner described in the SP&E paper handles all of these
constructs with loops that generate assembly code of the following form
(this code was extracted from the hand-optimized assembly code file used to
produce the measurements in the second column):

L68:
movb a4@+,d0 Get an input character
extw d0 Make it a suitable index
btst #2,a5@(0,d0:w) Check whether it is allowed
jne L68 If so, continue

The mask (here "#2") defines a particular "continuation set" as discussed
in the SP&E paper. A maximum of 8 such sets are stored in a
single 128-byte array.

The same characters are handled in the general finite-state machine by the
following sequence:

L40:
movb a5@+,d0 Get an input character
extw d0 Make it a suitable index
movl a4@(0,d0:w:4),a0 Get the state table column
movb a0@(0,d7:l),d7 Get the next state
jgt L40 Continue if nothing special

I used the same strategy for keyword recognition as is used in GLA:
Keywords are recognized as identifiers and then their identity is
determined from an identifier table lookup. This results in a huge
reduction in the size of the state table (from about 165 states to about 37
states). I also retained the "filtering" on the first character that is
provided by the GLA-generated analyzer, in order to reduce the size of the
tables further. The final state table had 18 states and 15 character
classes. Each character class was represented by an array of 18 bytes, and
that array was addressed by a pointer in the character class table. There
was also an auxiliary table for special actions such as setting a backtrack
point (consider the backtracking necessary when recognizing "1..10"), but
this table was not accessed in the algorithm's main loop. It is clear that
there is absolutely no point in incurring the extra cost associated with
overlaying the character class arrays (as is done by LEX) in this case
because of the small amount of space (270 bytes) they occupy.

The ratio of overall execution times of the spe and lex versions is close
to the ratio of the instruction times of these two code bodies. Unless the
table structure is changed markedly, there seems to be no way of closing
the gap. The spe loop is shorter and faster because it makes a different
test. It asks "should we stay in the same state?", whereas the general
machine says "we need to go into state i next". This latter statement
conveys more information, and hence it can be used in a larger number of
cases. Unfortunately, the common cases in programming languages seem to
require only the former. Of course, if the statistics of an application are
vastly different from those quoted here then the results may be quite
different.

My group would appreciate comments on these measurements. In particular, we
are interested in concrete data that supports different conclusions about
the nature of the lexical analysis process. Please reply to
"compiler@boulder" or "compiler@colorado.edu".

[Did you see what the effect was of treating keywords as identifiers? I'd
think that Pascal programs have a rather high density of keywords, and
special case code to deal with them might pay off. -John]
--

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Speed of Lex tokenizers

decvax!boulder!scotty!waite (William Waite )Sun, 17 May 87 17:26:14 MDT

decvax!boulder!scotty!waite (William Waite )
Sun, 17 May 87 17:26:14 MDT