Wed, 2 Sep 2009 08:45:22 -0500

Related articles |
---|

Deterministic Finite Automata: Sub-element captures alexander.morou@gmail.com (Alexander Morou) (2009-09-02) |

Re: Deterministic Finite Automata: Sub-element captures rsc@swtch.com (Russ Cox) (2009-09-02) |

Re: Deterministic Finite Automata: Sub-element captures Danny.Dube@ift.ulaval.ca (2009-09-12) |

From: | Alexander Morou <alexander.morou@gmail.com> |

Newsgroups: | comp.compilers |

Date: | Wed, 2 Sep 2009 08:45:22 -0500 |

Organization: | Compilers Central |

Keywords: | DFA, theory, question |

Posted-Date: | 02 Sep 2009 23:39:46 EDT |

Greetings,

I'm curious about regular expression-styled deterministic state

machines, and the groundwork that would be necessary for obtaining

the explicit captures from the regular expression.

Originally, in the project I'm working on, the tokens came in two

varieties: Captures and Enumerations.

Captures were simply a literal capture of the characters that made up

the string that matched; they were simple recognizers.

Enumerators were a simple series of named literals which were useful

for categorizing multiple simple tokens under a common name (example:

keywords); enumerators were transducer type state machines.

I finally produced a nearly complete lexer and integrated the stream

state machine, and realized that I had a transducer state machine of

the language (and could therefore verify if a string was at least

valid), but I needed to properly work out how the functional aspects

of a capturing state machine worked before I could build the final

parse graph of a pass over a string from that language.

Let's say you have a token called number:

Number :=

"0x" [0-9A-Fa-f]:WholePart;+

(

@'U':Unsigned; @'L':Long;? |

@'L':Long; @'U':Unsigned;?

):DataType;? |

[0-9]:WholePart;+

(

(

@'U':Unsigned; @'L':Long;? |

@'L':Long; @'U':Unsigned;? |

@'D':Double; |

@'F':Single; |

@'M':Decimal;

):DataType; |

'.' [0-9]:FloatingPart;+

(

@'e'

(

'+':Positive; |

'-':Negative;

):Direction;?

[0-9]+

):ePart;?

(

@'D':Double; |

@'F':Single; |

@'M':Decimal;

):DataType;?

)?;

Where all literals are in single or double quotes, '@' before a

literal means case insensitivity. When an alternation is used, if

two values on either side of the alternation are named the same, they

carry the same identity. For explicit captures within a DFA system,

would the mechanism operate via denoting the start and end of a given

capture's identity; setting/clearing the associated flags to the

capture as its start and end states are entered? Would repetition on

groups increase the rank of all elements, such that were there:

('a':ACapture;+ '.')*

'ACapture' would carry a rank of two (or one, if the desired capture

type for 'ACapture' is a string vs. char). Greater ranks determine

whether the item's capture is cleared upon entering the states or

whether an element is simply added to the list upon entering its exit

state.

The reason I'm curious about such things is the project I'm working

on doesn't have blocks defined which state what to do when a capture

is encountered, so there's a certain level of work I have to do which

I might not otherwise have to. I figure that recognizing 'what' and

'when' is important even if there were match code associated to the

expressions.

I'm also in the process of rewriting the lexer state machine builder,

so while I'm solving this problem, I want to solve it in both places

instead of just one.

Am I even going in the right direction if I'm interested in capturing

the sub-elements of a token or production using a deterministic state

machine?

Thanks,

Alexander [Allen C. Copeland Jr.] Morou

PS: Is Regular Language more appropriate to use than Regular

Expression?

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.