Parallel lexers by chunking the input

Christopher F Clark <christopher.f.clark@compiler-resources.com>
Thu, 24 Mar 2022 22:32:12 +0200

          From comp.compilers

Related articles
Parallel lexers by chunking the input christopher.f.clark@compiler-resources.com (Christopher F Clark) (2022-03-24)
Re: Parallel lexers by chunking the input anton@mips.complang.tuwien.ac.at (2022-03-25)
Re: Parallel lexers by chunking the input gah4@u.washington.edu (gah4) (2022-03-25)
Re: Parallel lexers by chunking the input anton@mips.complang.tuwien.ac.at (2022-03-25)
Parallel lexers by chunking the input. christopher.f.clark@compiler-resources.com (Christopher F Clark) (2022-03-26)
Parallel lexers by chunking the input christopher.f.clark@compiler-resources.com (Christopher F Clark) (2022-03-26)
Re: Parallel lexers by chunking the input anton@mips.complang.tuwien.ac.at (2022-03-26)
| List of all articles for this month |
From: Christopher F Clark <christopher.f.clark@compiler-resources.com>
Newsgroups: comp.compilers
Date: Thu, 24 Mar 2022 22:32:12 +0200
Organization: Compilers Central
References: 22-03-058
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="48055"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, parallel
Posted-Date: 24 Mar 2022 19:20:45 EDT

Under certain very limited conditions you can parallelize your lexer
by chunking the inputs, but the number of cases for which that breaks
make it a rather useless endeavor. At Intel we had Charlie Lasswell
and Michela Becchi look into the problem from two different angles.
Michela has some excellent papers about dealing with ".*" and related
constructs, especially overlapping ".{m,n}" constructs.


But, the real issue is strings, comments, and markdown like features.
They all suffer from the same issue, which is that they are generally
unbounded strings that start and stop with very short sequences. And,
by picking some arbitrary point in the future to start scanning, you
cannot be sure whether you are inside or outside one of those
features. This, of course, also makes for very problematic error
messages. Leave out a single character (e.g. an opening or closing
quote) and text can be entirely misinterpreted.


One common attempt at mitigating that problem is limiting such
sequences to a single line. However, the result of that is the need
for "raw" strings where you can
have long strings that span multiple lines.


Like


```
code goes in here
and can include `, ', and " marks where appropriate
and it doesn't end until one sees
```


of maybe one prefers


#" to delimit a raw string
      allowing embedded " and ' marks
      and maybe the processor throws away indentation within it
      and trailing whitespace at the end of a line
      but we still are looking for the terminating
      and ``` here isn't special either
"#


And, before you think that's a ridiculous example, let me tell you a
very real use of it. I'm building a compiler and in particular unit
tests for it. So, I need to have strings that represent significant
fragments of real programs and other strings, the "golden" results,
that capture json output of the internal structure which I cut and
paste from dumps of it as a string. Both of these fragments
(especially the latter) can be multiple lines long and can look like
code (the first actually are code) but which I don't want parsed as
such. A parallelized lexer could easily attempt to tokenize those
fragments if it picked a bad starting point. That speculative lexing
would actually make the process slower.


Preston Briggs was right when he talked about regular expressions
being an "embarrassingly serial application".


In fact, I have numerous times thought about applying Boyer-Moore like
algorithms to lexing and parsing, and then remembering these cases put
it down as technically feasible but practically useless. About the
only way to use parallelism to speed up lexing is if one has separate
files being processed and rules which keep tokens from changing state
over a file boundary. Now, that last requirement isn't as hard as it
seems, as usually "include" files can only get recognized at
token boundaries. Still the speedup therefrom is not that useful.
There are other better solutions to that problem.


--
******************************************************************************
Chris Clark email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.