Re: Speeding up LEX scanning times

eifrig@beanworld.cs.jhu.edu (Jonathan Eifrig)
Tue, 7 Feb 1995 16:49:19 GMT

          From comp.compilers

Related articles
Speeding up LEX scanning times pahint@eunet.be (1995-02-02)
Re: Speeding up LEX scanning times dimock@das.harvard.edu (1995-02-02)
Re: Speeding up LEX scanning times c1veeru@WATSON.IBM.COM (Virendra K. Mehta) (1995-02-02)
Re: Speeding up LEX scanning times monnier@di.epfl.ch (Stefan Monnier) (1995-02-03)
Re: Speeding up LEX scanning times mercier@hollywood.cinenet.net (1995-02-03)
Re: Speeding up LEX scanning times vern@daffy.ee.lbl.gov (1995-02-04)
Re: Speeding up LEX scanning times eifrig@beanworld.cs.jhu.edu (1995-02-07)
| List of all articles for this month |
Newsgroups: comp.compilers
From: eifrig@beanworld.cs.jhu.edu (Jonathan Eifrig)
Keywords: parse, lex
Organization: The Johns Hopkins University CS Department
References: 95-02-041 95-02-057
Date: Tue, 7 Feb 1995 16:49:19 GMT

Virendra K. Mehta <c1veeru@WATSON.IBM.COM> wrote:
> COBOL lexical analyzers can be pretty slow because of the very complicated
> regular expressions involved.... These generate a sizable dfa and can slow
> the scanner down a lot.


Vern Paxson <vern@daffy.ee.lbl.gov> wrote:
>This is a common misconception (because it's so intuitive) regarding
>regular expression matching when using DFA's. A key point when using DFA's
>is that the run time is *independent* of the complexity of the RE. It's
>instead just linear in the number of characters scanned, since the DFA
>makes exactly one state transition per character, and figuring out which
>transition to make takes constant time.


Well, this is a good example of Moving Theory Into Practice
Axiom Number 1: "Not all constants are created equal."


The problem is that traversing a given arc in a DFA a *second* time
can be significantly faster than the first time, due to caching effects.
Reads from the cache are typically 2-3 times faster than reads from the
main memory. Typically, DFA acceptors are coded up as:


state = START_STATE;
while (!final_states[state]) {
token = input();
state = next_state[state][token];
}


(In practice, the "next_state" array is usually distributed as the jump
tables resulting from a bunch of "switch" statements, one for each state,
but the effect is the same.)


Reading from the "next_state" array is a significant part of the
time taken each pass through the loop. If the other parts of the loop
are fast, cache misses can have dramatic effects.


>So basically do as much scanning as possible using RE's: with a proper
>DFA implementation they're very fast.


But sometimes other techniques are faster. :-)


Jack Eifrig (eifrig@cs.jhu.edu) The Johns Hopkins University, C.S. Dept.
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.