Re: looking for a better way (help with algorithm design)

"Michael Roach" <>
25 Sep 2001 00:18:20 -0400

          From comp.compilers

Related articles
looking for a better way (help with algorithm design) (Michael Roach) (2001-09-25)
Re: looking for a better way (help with algorithm design) (Michael Roach) (2001-09-25)
Re: looking for a better way (help with algorithm design) (Georg Bisseling) (2001-09-29)
| List of all articles for this month |

From: "Michael Roach" <>
Newsgroups: comp.compilers
Date: 25 Sep 2001 00:18:20 -0400
Organization: Compilers Central
References: 01-09-107
Keywords: design
Posted-Date: 25 Sep 2001 00:18:20 EDT

----- Original Message -----
From: "Raoul Gough" <>
Sent: Sunday, September 23, 2001 8:57 PM

| How about this:
| map<char, map <int, set<StringID> > >
| This stores a mapping of characters to [mapping of locations to sets
| of string IDs]. This would tell, for instance, that character "A"
| occurs at position 1 in strings 3 and 4, and at position 2 in string
| 2. You could then very quickly find the intersection of sets for
| each of the characters in your abbrieviation, respecting
| location. Actually, maybe you could get even faster with
| map<int, map <char, set <StringID> > >
| Space complexity looks like (ummm, guessing wildly) O(nm
| log(n)log(m)) where n is the number of strings and m is the average
| string length.

This is actually pretty close to the scheme that I have been
considering since posting my original message. Here's what I have been
tinkering with; forgive me if this is a bit confusing, hopefully it
will become clear with examples below.

Given a set of keys S, create a mapping between each character in the
alphabet of S and those characters that follow it in one or more keys
within S. In this sense, I maintain the information regarding the
left->right ordering of characters.

Here's how I think it would work:

S = {1=shack, 2=snack, 3=snapple} you would wind up with something akin:

    a -> c e k l p = 1,2,3
    c -> k = 2
    e -> nul = 3
    h -> a c k = 1
    k -> nul = 1,2
    l -> e = 3
    n -> a c e k l p = 2,3
    p -> e l = 3
    s -> a c e h k l n p = 1,2,3

I call the vector indexed by the entire alphabet of S the FIRST set. The
FOLLOW set consists of all the characters that can legally follow FIRST(x).
The RESULT set then contains the indexes of the keys within S that a given
FIRST/FOLLOW pairing are valid for.

Using the following algorithm one should then be able to determine whether
its a valid unique abbreviation and its corresponding key, if conflicts
exist (ie., not enough info in the abbreviation) what they are, or rather
quickly that its invalid.

//proposed algorithm//

    FIRST, FOLLOW, and RESULT sets as described above.
    A = the abbreviation in question.
    i = index into A; initialized to start of A.

      IF NOT FIRST( A[i] )
            Print( "Invalid abbreviation %s. Invalid character %c at %d", A, A[i],
i );

      IF N > 1
            // more than one possible key matched.
            IF ++i > LENGTH( A )
                  // more information is needed to distinguish
                  // between possible matching keys.
                  Print( "Abbreviation %s is not unique. Closest possible matches
are\n", A );
                  PrintResultSet( FOLLOW( FIRST( A[i-1] )));
                  // can our new character actually follow the previous one?
                  IF A[i] IS IN FOLLOW( FIRST( A[i-1] ))
                        // yes. advance to the next RESULT set.
                        GOTO LOOP
                        // no.
                        Print( "Invalid abbreviation %s. Invalid character %c at %d", A,
A[i], i );
      ELSE IF N == 1
          IF i < LENGTH( A )
                // we came across a single match but we
                // still have more information in the
                // abbreviation.
                Print( "Invalid abbreviation %s - too many characters.", A );
                // we found a unique match
                PrintResultSet( FOLLOW( FIRST( A[i] )));
      ELSE Print( "Invalid abbreviation %s - no matches found.", A );

In words it amounts to verifying that the current character of the
abbreviation is in FIRST. If so, then if more than one RESULT entry is
found, checking to see if the next character is in the current FOLLOW set.
If it is, then repeat the process using the next character.

This seems to work out on paper using the key set {shack, snack, snapple}
and abbreviations of {nac, nc, sap, sn, nh} correctly identifying
{shack,snack,snapple} for the first 3 respectively and conflicts of
{snack,snapple} for sn and marking nh as invalid.

But it chokes on 'nak' where the letter k has two entries in its RESULT set
and an empty FOLLOW set. This might be overcome by remembering all previous
valid RESULT sets. In otherwords, I would have known that:
    n = {2,3}
    a = {1,2,3}
    k = {1,2}
then by taking the intersection could have gotten the correct key 2 (snack).
Whether this idea would scale to larger sets of keys, I have my doubts;
although I'm sure it would be required as the key set grew larger and the
FIRST/FOLLOW sets became saturated.

Another problem with this is the handling of consecutive letters such as
'pp' in snapple. If I add a 'p' to FOLLOW(FIRST(p)) then I allow invalid
abbreviations such as 'ppp' to match. But if I don't add it, a legitimate
'pp' would be invalidated. I don't have any ideas for overcoming that
problem yet.

Once the FIRST, FOLLOW, and RESULT sets have been created, and the previous
two issues I've identified above are solved (and no others identified),
searching should be extremely fast and space requirements would be rather
minimal for even large sets of keys (I think).

Something along these lines is what's been bouncing around in my head, but
this current algorithm smells more and more fishy as it scales to larger and
larger keysets.



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.