Re: Independent Study - Assistance?

Chris F Clark <cfc@shell01.TheWorld.com>
Fri, 27 Feb 2009 01:44:33 -0500

          From comp.compilers

Related articles
Independent Study - Assistance? Alex@alexandermorou.com (Alexander Morou) (2009-02-15)
Re: Independent Study - Assistance? cfc@shell01.TheWorld.com (Chris F Clark) (2009-02-16)
Re: Independent Study - Assistance? Alex@alexandermorou.com (Alexander Morou) (2009-02-21)
Re: Independent Study - Assistance? cfc@shell01.TheWorld.com (Chris F Clark) (2009-02-27)
Re: Independent Study - Assistance? alexander.morou@gmail.com (Alexander Morou) (2009-02-28)
| List of all articles for this month |
From: Chris F Clark <cfc@shell01.TheWorld.com>
Newsgroups: comp.compilers
Date: Fri, 27 Feb 2009 01:44:33 -0500
Organization: The World Public Access UNIX, Brookline, MA
References: <e45f4adf0902152017v2f7ce1c7g38ae34f7d42bc1b0@mail.gmail.com> 09-02-106
Keywords: lex, parse
Posted-Date: 27 Feb 2009 07:31:58 EST

Alexander Morou <Alex@alexandermorou.com> writes:


> Since I prefer code that is readable, as opposed to sets of numbers
> that leave me scratching my head, so it emits a simple state machine
> for the token. I hope the list maintainer doesn't mind, but here's an
> example of output:


Your code is ok, nothing surprisingly good or bad about it, at least
in terms of your stated goals. In fact, it looks a lot like how
certain verilog designers encode state machines in their logic.


> I don't have the experience necessary to create a regular expression
> that can show the system fault; however, I can say what it is not: It
> isn't a regex parser generator, because it doesn't have look-behinds
> or look-aheads, and notably by the source above: it doesn't do any
> capturing (yet).


Ok, here you need some theoretical gaps covered.


The term "regex parser generator" is a mixture of several concepts.


      Regex as you appear to be using the term is not a concept with a
      well-founded theoretical basis (currently, although there are some
      researchers who are getting such a basis in place)--it is more an
      ad hoc construction used to describe regular expressions exteneded
      with captures and other elements. Your usage of the term seems to
      correspond to Perl (or PCRE) regexes.


      Now, if you were talking about regular expressions (a well-founded
      theory), the term generally used is a lexer generator rather than a
      parser generator. However, your subsequent usage (in this and your
      following post) seems to indicate that you do mean what is
      traditionally is termed parsing (the assigning of input sentences
      to the corrsponding derivation (parse) trees).


      To be more clear about this. Regular expressions are generally
      used in lexing--the mapping of inputs to tokens. The structure of
      the regular expression is not considered, i.e. we don't care how
      the characters in the input are matched to the various parts of a
      regular expression, only which token the string matches.


      The regex extensions (captures, assertions, back references, et al)
      are not used in lexing because they require keeping of the matches
      within a token. This breaks the theoretical equivalences that make
      a DFA and an NFA ingterchageable among other things, and potentially
      require a Turing Machine to match.


      In parsing, one doesn't traditionally use regex extensions either.
      The language of BNF or EBNF (= BNF + regular expressions) is
      generally sufficient to describe programming languages (or
      significant subsets thereof, with the missing parts covered by
      "hacks"). Note EBNF is not more powerful than BNF, simply more
      succinct. The main forms of parsing used with BNF are the LL and
      LR families of grammars. Both of these have linear time parsing
      methods. Recently, these methods have begun to be supplanted by
      extensions such as predicated grammars, PEGs, and GLR parsing,
      which admit a wider class of languages, but have linear time for
      the common cases. Again, the regex extensions do not have
      guaranteed linear time implementations and thus are not used. It's
      not that one couldn't use them, but the theory isn't currently
      sufficiently developed to do so without significant "risk".


      In your follow on question, where you are doing minimization, you
      cannot merge states when the states generate (or lead to the
      generation of) different outputs. If you omit that check, you will
      get results similar to what you describe.


I know you are anxious to plow ahead, but your doing this without the
underlying theory is causing you to make mistakes that could easily be
avoided. Now, it may be possible that those mistakes may result in
significant innovations, but the more likely case is that you will
produce something that looks nice, but is inherently flawed. It would
only take a couple of courses to rectify that problem, as I noted
earlier. Until you do, the best I can do is point out your most
egregious mistakes and leave the subtle flaws un-discussed as you
don't have the background to understand why they are errors.


I'm sorry,
-Chris


******************************************************************************
Chris Clark Internet: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. or: compres@world.std.com
23 Bailey Rd Web Site: http://world.std.com/~compres
Berlin, MA 01503 voice: (508) 435-5016
USA fax: (978) 838-0263 (24 hours)


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.