Re: lexical analysis question

Chris F Clark <cfc@world.std.com>
30 Mar 2003 21:26:23 -0500

          From comp.compilers

Related articles
lexical analysis question dreder1ck@hotmail.com (2003-03-30)
Re: lexical analysis question matkin@acm.org (Mats Kindahl) (2003-03-30)
Re: lexical analysis question cfc@world.std.com (Chris F Clark) (2003-03-30)
Re: lexical analysis question dreder1ck@hotmail.com (2003-04-05)
| List of all articles for this month |

From: Chris F Clark <cfc@world.std.com>
Newsgroups: comp.compilers,comp.theory
Date: 30 Mar 2003 21:26:23 -0500
Organization: The World Public Access UNIX, Brookline, MA
References: 03-03-178
Keywords: lex
Posted-Date: 30 Mar 2003 21:26:23 EST

Your problem as posed is not completely precise. It is possible you
are trying to solve one of several different problems. (Text of
original posting cited at end for those who wish to read the problem
as written for the benefit of those readers in comp.theory who may not
have seen the original posting.)


I am hoping this is not a homework problem. It is just contrived
enough to be a homework problem in an automata class and represent a
special class of machines that I'm not aware of. However, it is also
possible you are trying to solve an "interesting" problem.


Let us review your problem:


You have a set of regular expressions, E[i].
You have a stream of symbols, s.
You have a constant bound on the lookahead, C.


You also state that your regular expressions are not ambiguous. That
clearly means that for two expressions E[i] and E[j], they do not
match exactly the same string.


Does it also mean that no string matching E[i] is a prefix of a string
matching E[j] (and vice versa)? Is it possible for a prefix of string
matching E[j] to match a suffix of a string matching E[i]?


Next, can your stream of symbols contain errors, sequences of symbols
that match no regular expression or is it error-free?


----------------------------------------------------------------------


If the no prefix/no suffix property holds and the input can contain
errors, then the algorithm of LEX (construct a DFA to recognize the
string) is minimal in the sense that it is linear, will never
backtrack, and will recognize each E[i] upon processing the last
character which the E[i] matches.


If your input
1) has potential errors (sequences of symbols that may not
                      match any regular expression)
2) and you do not have "overlap ambiguities" (see the next
section) in your regular expressions that make finding the
end of a regular expression problematic,
then you cannot do asymptotically better than a DFA (look at each
symbol once). The reason being that you must look at each symbol at
least once to validate that it is not an error. In addition, the DFA
will look at each symbol exactly once to determine if it is the end of
a regular expression.


If this were a homework problem, I would expect you to prove that with
more rigor than the argument given above. However, since you claimed
to already know about DFA's, I think stating that they are the
solution to this interpretation of your problem, is not a particularly
big hint. You pretty much claimed this already in your posting.


----------------------------------------------------------------------


If the regular expressions, while unambiguous, have prefixes that can
match suffixes of other strings (an "overlapping ambiguity"), your
constant bound of C becomes relevant in that one cannot distinuguish
"overlap ambiguities" that are longer than C.


Note, the "overlap ambiguity" problem is a hard one. The LEX solution
of take the longest match of the left-most regular expression is a
heuristic that mostly works, but fails for certains sets of regular
expressions (ones where to match a later regular expression, the
left-most token must not be matched to the longest case, but to some
shorter sequence). Next is an example language, that cannot be
succussfully processed if one uses the longest-match rule:


E[0]: a b+;
E[1]: b c;


s = a b b c


correct results: E[0]->a b, E[1]->b c
longest match results: E[0]->a b b, error no match for "c"


In this case, one has to lookahead one symbol past the end of E[0] to
see if the next symbol is a "c", requiring lookahead 2. Adding more
b's to the beginning of E[1] increases this lookahead requirement.
Note, that in any case, the problem can still be solved with a DFA.


You can even find sets of regular expressions where 3 or more of the
rules interact to determine the end of the first regular expression.


E[0]: a b+;
E[1]: b c;
E[2]: c c d;


s = a b b c c d => E[0]->a b b, E[2]->c c d
s = a b b c c c d => E[0]->a b, E[1]->b c, E[2]->c c d


In this case, you don't know when to match E[0] until you have seen
not only a "c", but as many as 3 characters following its end to see
if the sequence is "c c d".


I believe Deborah Weber-Wulfe (sp?) (working at a University in
Berlin) did work on the overlap ambiguity problem while trying to
write a provably correct lexer generator.


You will also find some interesting insights into the overlap
ambiguity problem by learning about Perl's non-greedy regular
expression operators.


In case this is a homework problem, can you show that for any given
set E, how many lookahead symbols are required to determine the end of
any given regular expression? Is there an algorithm for computing it?
Or, is there a counter-example which requires unbounded lookahead
beyond the end of a regular expression to determine when the regular
expression is matched (and is not part of the overlap with some other
regular expression)?


----------------------------------------------------------------------


If the regular expressions have prefix overlaps, the C also becomes
relevant in that "ambiguous prefixes" longer than C cannot be
disambiguated.


E[0]: a+;
E[1]: a b;


This language requires lookahead of 2 symbols (the 1st "a" and the 2nd
"a" or "b") to disambiguate E[0] and E[1].


----------------------------------------------------------------------


Finally, if your input is error free, you can potentially do better
than looking at every symbol. The Boyer-Moore string search
algorithms are a good place to look. They skip n-symbols at at time.
You can adapt the algorithm of Boyer-Moore for finding the end of a
regular expression. You can also adapt it to dismbiguating ambiguous
prefixes. For example, in the previous example, you never need to
look at the first character, you always just lookahead to the 2nd.
However, such algorithms are still always linear, as regular
expressions only describe inputs that have linear lengths, so you'll
never find a logarithmic matcher (for instance).


As far as I know, no one has ever studied this particular avenue
extensively. The reason being, is that one must assume error free
input (to be able to skip over symbols). Error free inputs are not
guaranteed in most lexing/parsing applications--they may make sense
for automata explorations, as you often aren't dealing with human
generated input, but something that represents a controlled language
that is known to have certain properties.


One intersting case of this has been studied by either Nakata or Sassa
(I cannot remember which--they co-authored one of the seminal papers
on ELR parsing) based on a special lexical operator available in
Yacc++. The Yacc++ operator can be used to guarantee that all inputs
are error free (while not ambiguous), and thus the special cases
apply.


Hope this helps,
-Chris


*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
3 Proctor Street voice : (508) 435-5016
Hopkinton, MA 01748 USA fax : (508) 435-4847 (24 hours)
------------------------------------------------------------------------------


> Hi, I'm trying to figure out a lexical analysis problem. I have a few
> questions about it. (I think I asked on the wrong compilers newsgorup
> before).
>
> The problem: Develop an algorithm for lexical analysis (similar to
> what lex does), where the input is a set of regular expressions, a
> constant C, and a stream of symbols. The output is some string
> signifying which regular expressions were matched, in what order.
> However, this algorithm can only use a constant lookahead C, which in
> this particular case means that if s[1...infinity] is the input
> stream, you are not allowed to look at symbol s[i+C] or any later
> symbol until you have already matched symbol s[i] to some regular
> expression. Come up with the fastest algorithm possible for this.
>
> (obviously this algorithm would not be able to handle arbitrary
> regular expressions, due to the fixed lookahead. It only needs to work
> for sets of regular expressions where ambiguity cannot arise)
>
> My questions:
>
> (1) Has this problem been studied extensively?
>
> (2) Is the use if 'fixed lookahead' as I describe it above common? Or
> does 'fixed lookahead' usually have some other meaning?
>
> (3) If I want to get some ideas for this problem, what resources do
> you suggest that I look at? I have taken a class in finite automata,
> so I'm somewhat familiar with regular expressions and DFAs. I think I
> understand very abstractly algorithm behind lex -- setting up the NFA
> based on the regular expressions, converting this to a DFA, and then
> trying to minimize this DFA. But, this doesn't really utilize the
> constant C..
>
> I am stuck on this problem -- does anyone have advice about what
> avenues to persue?
>
> -Drederick


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.