Deterministic Finite Automata: Sub-element captures

Alexander Morou <alexander.morou@gmail.com>
Wed, 2 Sep 2009 08:45:22 -0500

          From comp.compilers

Related articles
Deterministic Finite Automata: Sub-element captures alexander.morou@gmail.com (Alexander Morou) (2009-09-02)
Re: Deterministic Finite Automata: Sub-element captures rsc@swtch.com (Russ Cox) (2009-09-02)
Re: Deterministic Finite Automata: Sub-element captures Danny.Dube@ift.ulaval.ca (2009-09-12)
| List of all articles for this month |
From: Alexander Morou <alexander.morou@gmail.com>
Newsgroups: comp.compilers
Date: Wed, 2 Sep 2009 08:45:22 -0500
Organization: Compilers Central
Keywords: DFA, theory, question
Posted-Date: 02 Sep 2009 23:39:46 EDT

Greetings,


I'm curious about regular expression-styled deterministic state
machines, and the groundwork that would be necessary for obtaining
the explicit captures from the regular expression.


Originally, in the project I'm working on, the tokens came in two
varieties: Captures and Enumerations.


Captures were simply a literal capture of the characters that made up
the string that matched; they were simple recognizers.


Enumerators were a simple series of named literals which were useful
for categorizing multiple simple tokens under a common name (example:
keywords); enumerators were transducer type state machines.


I finally produced a nearly complete lexer and integrated the stream
state machine, and realized that I had a transducer state machine of
the language (and could therefore verify if a string was at least
valid), but I needed to properly work out how the functional aspects
of a capturing state machine worked before I could build the final
parse graph of a pass over a string from that language.


Let's say you have a token called number:
Number :=
        "0x" [0-9A-Fa-f]:WholePart;+
                (
                        @'U':Unsigned; @'L':Long;? |
                        @'L':Long; @'U':Unsigned;?
                ):DataType;? |
        [0-9]:WholePart;+
        (
                (
                        @'U':Unsigned; @'L':Long;? |
                        @'L':Long; @'U':Unsigned;? |
                        @'D':Double; |
                        @'F':Single; |
                        @'M':Decimal;
                ):DataType; |
                '.' [0-9]:FloatingPart;+
                (
                        @'e'
                        (
                                '+':Positive; |
                                '-':Negative;
                        ):Direction;?
                        [0-9]+
                ):ePart;?
                (
                        @'D':Double; |
                        @'F':Single; |
                        @'M':Decimal;
                ):DataType;?
        )?;


Where all literals are in single or double quotes, '@' before a
literal means case insensitivity. When an alternation is used, if
two values on either side of the alternation are named the same, they
carry the same identity. For explicit captures within a DFA system,
would the mechanism operate via denoting the start and end of a given
capture's identity; setting/clearing the associated flags to the
capture as its start and end states are entered? Would repetition on
groups increase the rank of all elements, such that were there:


        ('a':ACapture;+ '.')*


'ACapture' would carry a rank of two (or one, if the desired capture
type for 'ACapture' is a string vs. char). Greater ranks determine
whether the item's capture is cleared upon entering the states or
whether an element is simply added to the list upon entering its exit
state.


The reason I'm curious about such things is the project I'm working
on doesn't have blocks defined which state what to do when a capture
is encountered, so there's a certain level of work I have to do which
I might not otherwise have to. I figure that recognizing 'what' and
'when' is important even if there were match code associated to the
expressions.


I'm also in the process of rewriting the lexer state machine builder,
so while I'm solving this problem, I want to solve it in both places
instead of just one.


Am I even going in the right direction if I'm interested in capturing
the sub-elements of a token or production using a deterministic state
machine?


Thanks,


Alexander [Allen C. Copeland Jr.] Morou


PS: Is Regular Language more appropriate to use than Regular
Expression?


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.