Re: lexical analysis question (Drederick)
5 Apr 2003 15:40:53 -0500

          From comp.compilers

Related articles
lexical analysis question (2003-03-30)
Re: lexical analysis question (Mats Kindahl) (2003-03-30)
Re: lexical analysis question (Chris F Clark) (2003-03-30)
Re: lexical analysis question (2003-04-05)
| List of all articles for this month |

From: (Drederick)
Newsgroups: comp.compilers,comp.theory
Date: 5 Apr 2003 15:40:53 -0500
References: 03-03-178 03-03-191
Keywords: lex
Posted-Date: 05 Apr 2003 15:40:53 EST

Chris F Clark <> wrote in message news:03-03-191...
> I am hoping this is not a homework problem. It is just contrived
> enough to be a homework problem in an automata class and represent a
> special class of machines that I'm not aware of. However, it is also
> possible you are trying to solve an "interesting" problem.

Well, it is not a 'normal' homework problem. I asked my professor if
he would think of an interesting problem for me and this was the
result. I will probably end up doing some sort of project related to
it, unless I discover that it isn't interesting. (hence why my
original message was not soliciting solutions).

> Does it also mean that no string matching E[i] is a prefix of a string
> matching E[j] (and vice versa)? Is it possible for a prefix of string
> matching E[j] to match a suffix of a string matching E[i]?

The situation in your first sentence is possible. The algorithm always
matches the longest string possible which matches some regular
expression (starting from the beginning of the yet unmatched symbols).

> Next, can your stream of symbols contain errors, sequences of symbols
> that match no regular expression or is it error-free?

The sequence is error-free.

> If the regular expressions, while unambiguous, have prefixes that can
> match suffixes of other strings (an "overlapping ambiguity"), your
> constant bound of C becomes relevant in that one cannot distinuguish
> "overlap ambiguities" that are longer than C.


> E[0]: a b+;
> E[1]: b c;
> s = a b b c
> correct results: E[0]->a b, E[1]->b c
> longest match results: E[0]->a b b, error no match for "c"

OK. The algorithm I envision is supposed to greedily match E[0] and
then give the error.

> You can even find sets of regular expressions where 3 or more of the
> rules interact to determine the end of the first regular expression.
> E[0]: a b+;
> E[1]: b c;
> E[2]: c c d;
> s = a b b c c d => E[0]->a b b, E[2]->c c d
> s = a b b c c c d => E[0]->a b, E[1]->b c, E[2]->c c d
> In this case, you don't know when to match E[0] until you have seen
> not only a "c", but as many as 3 characters following its end to see
> if the sequence is "c c d".

Hmm, that general problem does look nasty.

> In case this is a homework problem, can you show that for any given
> set E, how many lookahead symbols are required to determine the end of
> any given regular expression?

I think I see what you mean.. my attempt is below

> Is there an algorithm for computing it?
> Or, is there a counter-example which requires unbounded lookahead
> beyond the end of a regular expression to determine when the regular
> expression is matched (and is not part of the overlap with some other
> regular expression)?

It seems like a counterexample is:

E[0] = ab+
E[1] = bc+d
E[2] = bbc+e

input: abbbcccccccc<arbitrarily many c symbols>cccd

The algorithm wouldn't know whether to match ab or abb until it found
the letter after the last c symbol, and we cant put a limit on the #
of c symbols. Was that the right interpretation of your question?


> Hope this helps,
> -Chris

Yes, it was very helpful. Thank you for the detailed response. Do you
think most compiler textbooks would give me the background needed to
solve this problem? Are there any particular books or references that
you think would be good, so I don't have to spend time re-inventing
theory that is not very unique to this problem?


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.