Thu, 23 Feb 2023 00:34:29 -0000 (UTC)

Related articles |
---|

syntax complexity gah4@u.washington.edu (gah4) (2023-02-15) |

Re: syntax complexity DrDiettrich1@netscape.net (Hans-Peter Diettrich) (2023-02-16) |

Re: syntax complexity gah4@u.washington.edu (gah4) (2023-02-16) |

Re: syntax complexity costello@mitre.org (Roger L Costello) (2023-02-20) |

Re: syntax complexity gah4@u.washington.edu (gah4) (2023-02-20) |

Re: syntax complexity anton@mips.complang.tuwien.ac.at (2023-02-21) |

Re: irregular expressions, syntax complexity arnold@freefriends.org (2023-02-22) |

Re: irregular expressions, syntax complexity 864-117-4973@kylheku.com (Kaz Kylheku) (2023-02-23) |

From: | Kaz Kylheku <864-117-4973@kylheku.com> |

Newsgroups: | comp.compilers |

Date: | Thu, 23 Feb 2023 00:34:29 -0000 (UTC) |

Organization: | A noiseless patient Spider |

References: | <AQHZRT1Nf7Ln0mpyG0extuZP9rnVPQ==> 23-02-045 23-02-047 23-02-050 <29156_1676600565_63EEE4F4_29156_1009_1_23-02-051@comp.compilers> 23-02-052 23-02-053 23-02-055 |

Injection-Info: | gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="62536"; mail-complaints-to="abuse@iecc.com" |

Keywords: | lex |

Posted-Date: | 22 Feb 2023 19:48:27 EST |

On 2023-02-21, Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

*> You can translate regexps with & to a DFA, and don't lose any*

*> performance there. For an NFA-based implementation, the only way I*

*> see is that you process both sub-NFAs and check if they both accept*

*> the string, so it's somewhat slower; I guess that this is a major*

*> reason why such an operator has not been added to widely-used regexp*

*> syntaxes.*

*>*

*> Has anybody implemented such an operator at all?*

I implemented & in the TXR language.

It's based on Brzozowski regex derivatives (described first in 1964,

IIRC).

Derivatives allow a regex syntax with & to be interpreted directly,

and without backtracking or trying sub-cases; effectively, the approach

is equivalent to an NFA.

(Derivatives can be used for compiling also. The problem you have to

solve is deciding when two regexes are equivalent, so that you can

close loops in the state machine graph. The better you can do this, the

tighter the graph will be,)

The idea behind a "derivative" is to take an input symbol S and a regex

R, and calculate a new expression R' which matches the rest of the input

after S.

E .g. the derivativre of "abc" with respect to "a" would be "bc".

Likewise with respect to "." (match any single character).

The derivative of "abc" with respect to d would be ∅, the nomatch regex

which describes/matches nothing: empty set of strings. (Not to be

confused with the empty regex that matches the empty string, denoting

the set { "" }).

To match a string of symbols, you simply take successive derivatives. If

you hit ∅, you can short-circuit to the conclusion that the input

doesn't match.

When you run out of input symbols, the remaining derivative must be

nullable: nullable means capable of matching the empty string. If it

isn't, it means the regex wants more characters, so you don't have

a match.

The empty set ∅ regex isn't nullable, needless to say; a regex is

nullable if the language (set of strings) contains the empty string.

Anyway, you can represent regexes in Lisp syntax and then the derivative

operation is just a functional syntax -> syntax calculation (which

generates a lot of garbage as we match strings).

The & operator is easily supported because the derivative operation

readily distributes into the & operatror according to an easy rule.

d(S, R1 & R2) -> d(S, R1) & d(S, R2)

Basically you take the derivatives of the branches, and combine them

with & again.

The more familiar | operator distributes the same way.

IN an actual implementation, there are tricks you can pull, like

short-circuit evaluation. If you take the d(S, R1) derivative of the

left side of R1&R2, and that turns out to be ∅, you can short-circuit

to ∅, If you have way to detect that a regular expression A matches all

strings (like .*) then you know that A&R is equiavlent to R; you can

drop that branch, similarly A|R is A.

You can see that this is amenable to compilation, similar to the subset

construction; you can calculate a derivative with respect to all

possible input symbols and build a graph in which the regex syntax

itself (all the derivatives) are the states. To close loops in the

graph, you need to detect states you have seen before, which requires

equivalence test between two syntax fragments. The fewer false negatives

you have, the tighter the graph.

TXR's regex engine has a front end which checks whether the syntax

contains the "exotic operators" and chooses either the derivative-based

back end or compilation to NFA and simulation.

I haven't bothered implementing better compilation for either back end;

if a collaborator came along to explore those possibilities, that would

be great. Most of that work was done over a decade ago. There is now a

powerful Lisp dialect that would make easy work of writing a

derivative-based regex compiler.

--

TXR Programming Language: http://nongnu.org/txr

Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

Mastodon: @Kazinator@mstdn.ca

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.