How make multifinished DFA for merged regexps?

Christopher F Clark <christopher.f.clark@compiler-resources.com>
Mon, 23 Dec 2019 05:40:43 -0500

          From comp.compilers

Related articles
How make multifinished DFA for merged regexps? borucki.andrzej@gmail.com (Andy) (2019-12-19)
Re: How make multifinished DFA for merged regexps? 493-878-3164@kylheku.com (Kaz Kylheku) (2019-12-20)
How make multifinished DFA for merged regexps? christopher.f.clark@compiler-resources.com (Christopher F Clark) (2019-12-20)
Re: How make multifinished DFA for merged regexps? borucki.andrzej@gmail.com (Andy) (2019-12-20)
Re: How make multifinished DFA for merged regexps? 493-878-3164@kylheku.com (Kaz Kylheku) (2019-12-21)
How make multifinished DFA for merged regexps? christopher.f.clark@compiler-resources.com (Christopher F Clark) (2019-12-23)
Re: How make multifinished DFA for merged regexps? DrDiettrich1@netscape.net (Hans-Peter Diettrich) (2019-12-24)
Re: How make multifinished DFA for merged regexps? matt.timmermans@gmail.com (Matt Timmermans) (2019-12-23)
Re: How make multifinished DFA for merged regexps? christopher.f.clark@compiler-resources.com (Christopher F Clark) (2019-12-24)
Re: How make multifinished DFA for merged regexps? rockbrentwood@gmail.com (2019-12-29)
| List of all articles for this month |

From: Christopher F Clark <christopher.f.clark@compiler-resources.com>
Newsgroups: comp.compilers
Date: Mon, 23 Dec 2019 05:40:43 -0500
Organization: Compilers Central
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="26633"; mail-complaints-to="abuse@iecc.com"
Keywords: DFA, parse
Posted-Date: 23 Dec 2019 18:29:13 EST

Andy borucki.andrzej@gmail.com wrote:


> Greedy algorithms match longest regexp. For example operators "+" and "++",
> int numbers "123" and float numbers "123.456e3".
> On '.' will finish state of number, but we will inside automata for float
> number. But can be errors: after '.' will 'a'. We must backtrack to last
> finished state? I want avoid backtracking. Maybe after backtracking we must
> read chars from auxiliary token buffer instead of stream up to previous
> position? But this complicated parsing.


1) Backtracking is one solution.


Using a character buffer (what you called a token buffer) rather than
reading the raw input can be used with or without backtracking. Using
a character buffer also tends to be efficient as you only call the I/O
routines infrequently in that case. Most lexers (such as flex) have
used character buffers to speed up their lexing significantly. The
only problem is if you mix lexing with reading from the input stream
directly, but that is a bad idea in itself.


2) Yet, another solution is to find the cases where errors are caused
and use them to make rules that return multiple tokens. E.g. for the
case “123.a” make a rule for “123.” That returns two tokens “123” as
an int and “.” as a dot (presuming that is a token in its own right).
This works well as long as there aren’t infinite strings of characters
that toggle back and forth as to what is legal and you can make a
finite set of rules that disambiguate what you are lexing.


This is not an easy problem to solve. A fictitious language with
tokens “a”, “ab”, “b”, and “ba” with longest match rule would find the
string “aba” ambiguous as to whether you meant “ab” “a” or “a” “ba” as
your two tokens and it is possible only the parser will know which you
wanted, or even worse it can be ambiguous even at the parsing level.
Generally, people give up on solving the problem.


Moreover, in practice, because such language fragments are
problematic. They apply the left-most longest rule and declare things
like “123.a” to be an error. It is actually a pretty rare problem,
because usually the token sequence for “123” “.” and “a” as 3 tokens
isn’t legal to the parser.


3) Finally, you can define your tokens to be only unique substrings
and put them together in the parser. E.g. “123” is always an “int”
token and “.” Is always a “dot” token and there is no “float” token,
just a float non-terminal that is “int dot int etc.”. Now, that is
also not a panacea. Especially if you have your lexer doing things
like throwing away whitespace and comments. On the other hand, I have
seen languages that allowed whitespace (and even occasionally
comments) inside of tokens. That has more to do with your language
specification/design.


--
*****************************************************************************
*
Chris Clark email:
christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.