30 Mar 2003 21:26:23 -0500

Related articles |
---|

lexical analysis question dreder1ck@hotmail.com (2003-03-30) |

Re: lexical analysis question matkin@acm.org (Mats Kindahl) (2003-03-30) |

Re: lexical analysis question cfc@world.std.com (Chris F Clark) (2003-03-30) |

Re: lexical analysis question dreder1ck@hotmail.com (2003-04-05) |

From: | Chris F Clark <cfc@world.std.com> |

Newsgroups: | comp.compilers,comp.theory |

Date: | 30 Mar 2003 21:26:23 -0500 |

Organization: | The World Public Access UNIX, Brookline, MA |

References: | 03-03-178 |

Keywords: | lex |

Posted-Date: | 30 Mar 2003 21:26:23 EST |

Your problem as posed is not completely precise. It is possible you

are trying to solve one of several different problems. (Text of

original posting cited at end for those who wish to read the problem

as written for the benefit of those readers in comp.theory who may not

have seen the original posting.)

I am hoping this is not a homework problem. It is just contrived

enough to be a homework problem in an automata class and represent a

special class of machines that I'm not aware of. However, it is also

possible you are trying to solve an "interesting" problem.

Let us review your problem:

You have a set of regular expressions, E[i].

You have a stream of symbols, s.

You have a constant bound on the lookahead, C.

You also state that your regular expressions are not ambiguous. That

clearly means that for two expressions E[i] and E[j], they do not

match exactly the same string.

Does it also mean that no string matching E[i] is a prefix of a string

matching E[j] (and vice versa)? Is it possible for a prefix of string

matching E[j] to match a suffix of a string matching E[i]?

Next, can your stream of symbols contain errors, sequences of symbols

that match no regular expression or is it error-free?

----------------------------------------------------------------------

If the no prefix/no suffix property holds and the input can contain

errors, then the algorithm of LEX (construct a DFA to recognize the

string) is minimal in the sense that it is linear, will never

backtrack, and will recognize each E[i] upon processing the last

character which the E[i] matches.

If your input

1) has potential errors (sequences of symbols that may not

match any regular expression)

2) and you do not have "overlap ambiguities" (see the next

section) in your regular expressions that make finding the

end of a regular expression problematic,

then you cannot do asymptotically better than a DFA (look at each

symbol once). The reason being that you must look at each symbol at

least once to validate that it is not an error. In addition, the DFA

will look at each symbol exactly once to determine if it is the end of

a regular expression.

If this were a homework problem, I would expect you to prove that with

more rigor than the argument given above. However, since you claimed

to already know about DFA's, I think stating that they are the

solution to this interpretation of your problem, is not a particularly

big hint. You pretty much claimed this already in your posting.

----------------------------------------------------------------------

If the regular expressions, while unambiguous, have prefixes that can

match suffixes of other strings (an "overlapping ambiguity"), your

constant bound of C becomes relevant in that one cannot distinuguish

"overlap ambiguities" that are longer than C.

Note, the "overlap ambiguity" problem is a hard one. The LEX solution

of take the longest match of the left-most regular expression is a

heuristic that mostly works, but fails for certains sets of regular

expressions (ones where to match a later regular expression, the

left-most token must not be matched to the longest case, but to some

shorter sequence). Next is an example language, that cannot be

succussfully processed if one uses the longest-match rule:

E[0]: a b+;

E[1]: b c;

s = a b b c

correct results: E[0]->a b, E[1]->b c

longest match results: E[0]->a b b, error no match for "c"

In this case, one has to lookahead one symbol past the end of E[0] to

see if the next symbol is a "c", requiring lookahead 2. Adding more

b's to the beginning of E[1] increases this lookahead requirement.

Note, that in any case, the problem can still be solved with a DFA.

You can even find sets of regular expressions where 3 or more of the

rules interact to determine the end of the first regular expression.

E[0]: a b+;

E[1]: b c;

E[2]: c c d;

s = a b b c c d => E[0]->a b b, E[2]->c c d

s = a b b c c c d => E[0]->a b, E[1]->b c, E[2]->c c d

In this case, you don't know when to match E[0] until you have seen

not only a "c", but as many as 3 characters following its end to see

if the sequence is "c c d".

I believe Deborah Weber-Wulfe (sp?) (working at a University in

Berlin) did work on the overlap ambiguity problem while trying to

write a provably correct lexer generator.

You will also find some interesting insights into the overlap

ambiguity problem by learning about Perl's non-greedy regular

expression operators.

In case this is a homework problem, can you show that for any given

set E, how many lookahead symbols are required to determine the end of

any given regular expression? Is there an algorithm for computing it?

Or, is there a counter-example which requires unbounded lookahead

beyond the end of a regular expression to determine when the regular

expression is matched (and is not part of the overlap with some other

regular expression)?

----------------------------------------------------------------------

If the regular expressions have prefix overlaps, the C also becomes

relevant in that "ambiguous prefixes" longer than C cannot be

disambiguated.

E[0]: a+;

E[1]: a b;

This language requires lookahead of 2 symbols (the 1st "a" and the 2nd

"a" or "b") to disambiguate E[0] and E[1].

----------------------------------------------------------------------

Finally, if your input is error free, you can potentially do better

than looking at every symbol. The Boyer-Moore string search

algorithms are a good place to look. They skip n-symbols at at time.

You can adapt the algorithm of Boyer-Moore for finding the end of a

regular expression. You can also adapt it to dismbiguating ambiguous

prefixes. For example, in the previous example, you never need to

look at the first character, you always just lookahead to the 2nd.

However, such algorithms are still always linear, as regular

expressions only describe inputs that have linear lengths, so you'll

never find a logarithmic matcher (for instance).

As far as I know, no one has ever studied this particular avenue

extensively. The reason being, is that one must assume error free

input (to be able to skip over symbols). Error free inputs are not

guaranteed in most lexing/parsing applications--they may make sense

for automata explorations, as you often aren't dealing with human

generated input, but something that represents a controlled language

that is known to have certain properties.

One intersting case of this has been studied by either Nakata or Sassa

(I cannot remember which--they co-authored one of the seminal papers

on ELR parsing) based on a special lexical operator available in

Yacc++. The Yacc++ operator can be used to guarantee that all inputs

are error free (while not ambiguous), and thus the special cases

apply.

Hope this helps,

-Chris

*****************************************************************************

Chris Clark Internet : compres@world.std.com

Compiler Resources, Inc. Web Site : http://world.std.com/~compres

3 Proctor Street voice : (508) 435-5016

Hopkinton, MA 01748 USA fax : (508) 435-4847 (24 hours)

------------------------------------------------------------------------------

*> Hi, I'm trying to figure out a lexical analysis problem. I have a few*

*> questions about it. (I think I asked on the wrong compilers newsgorup*

*> before).*

*>*

*> The problem: Develop an algorithm for lexical analysis (similar to*

*> what lex does), where the input is a set of regular expressions, a*

*> constant C, and a stream of symbols. The output is some string*

*> signifying which regular expressions were matched, in what order.*

*> However, this algorithm can only use a constant lookahead C, which in*

*> this particular case means that if s[1...infinity] is the input*

*> stream, you are not allowed to look at symbol s[i+C] or any later*

*> symbol until you have already matched symbol s[i] to some regular*

*> expression. Come up with the fastest algorithm possible for this.*

*>*

*> (obviously this algorithm would not be able to handle arbitrary*

*> regular expressions, due to the fixed lookahead. It only needs to work*

*> for sets of regular expressions where ambiguity cannot arise)*

*>*

*> My questions:*

*>*

*> (1) Has this problem been studied extensively?*

*>*

*> (2) Is the use if 'fixed lookahead' as I describe it above common? Or*

*> does 'fixed lookahead' usually have some other meaning?*

*>*

*> (3) If I want to get some ideas for this problem, what resources do*

*> you suggest that I look at? I have taken a class in finite automata,*

*> so I'm somewhat familiar with regular expressions and DFAs. I think I*

*> understand very abstractly algorithm behind lex -- setting up the NFA*

*> based on the regular expressions, converting this to a DFA, and then*

*> trying to minimize this DFA. But, this doesn't really utilize the*

*> constant C..*

*>*

*> I am stuck on this problem -- does anyone have advice about what*

*> avenues to persue?*

*>*

*> -Drederick*

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.