Re: Pascal vs C style string ? (Joseph H Allen)
Tue, 28 Jun 1994 08:04:15 GMT

          From comp.compilers

Related articles
[5 earlier articles]
Re: Pascal vs C style string ? (1994-06-27)
Re: Pascal vs C style string ? (1994-06-27)
Re: Pascal vs C style string ? (1994-06-28)
Re: Pascal vs C style string ? (Stefan Monnier) (1994-06-28)
Re: Pascal vs C style string ? (Erkki Ruohtula) (1994-06-28)
Re: Pascal vs C style string ? (1994-06-28)
Re: Pascal vs C style string ? (1994-06-28)
Re: Pascal vs C style string ? (1994-06-28)
Re: Pascal vs C style string ? (1994-06-28)
Searching (was Re: Pascal vs C style string ?) (1994-06-29)
Re: Pascal vs C style string ? (1994-06-29)
Re: Pascal vs C style string ? (1994-06-29)
Re: Pascal vs C style string ? (1994-06-30)
[4 later articles]
| List of all articles for this month |

Newsgroups: comp.compilers
From: (Joseph H Allen)
Keywords: C, Pascal, design
Organization: The World Public Access UNIX, Brookline, MA
References: 94-06-175 94-06-213
Date: Tue, 28 Jun 1994 08:04:15 GMT (Hans Boehm) writes:
>You might also want to look at the "cord" string package that's included
>with the garbage collector in It has
>the following characteristics:

This is a fascinating subject and one in which I've spent some time
researching. I took a look at your cord package. It shares many of the
ideas I implemented in two libraries of my own: the buffer management
routines in my editor JOE and a software virtual memory / file-buffering

My edit buffer management routines store a string as a doubly linked list
of fixed-length gap-buffers which can be swapped out to a file (thus
allowing you to edit files of any length). The advantage of this method
is that small localized string inserts and deletes are extremely fast.
Random indexing is slow, but this is not an operation which occurs often
in an editor as long as you have position buffers (as in your cord
library). However, there's a public domain editor out there called "The
Pointing Editor" which also uses cords (but its author calls them "pieces"

My file buffering package makes an arbitrarily large file look like a big
array and allows you to 'getc' just as quickly backwards as forwards. It
uses a hash table of recently used pages similar to that of an operating
systems file buffering system. As a software virtual memory system it's
better than others since you don't need special page-handle structures:
you just use the pages address as the handle (which you need anyway).

Your cord package is kind of a compromise between these two: it's
reasonably good at both inserting and indexing, but not horribly bad at
either (as both of my libraries are).

I especially like that you store the search path in the position buffers-
I used that trick in a b-tree library and in a skiplist implementation for
a malloc() library.

Here are some improvements you might make to your cord package:

1) Improve localized indexed access: Have CORD_fetch remember the latest
position for each string. This way you could make CORD_fetch into a macro
which only has to call a function when you go outside of the current node.
CORD_fetch would be nearly as fast as [ ] (C's array access operator).

2) Use a better algorithm for the substring search function. The
Moyer-Fig (I'm getting the name wrong) search algorithm can be adapted to
systems with position buffers. The algorithm goes like this:

      int table[256]
      for(x=0;x!=len-1;++x) table[string[x]]=x
      fwrd(pos,len) x=len amnt=size(pos)-offset(pos)
              if(x<=table[c]) fwrd(pos,len-x+1), --amnt
              else fwrd(pos,len-table[c]), amnt-=x-table[c]
              if(amnt<0) return NOTFOUND
              else x=len
      return FOUND

fwrd() moves a position forward. rgetc() moves a position backward 1 and
returns the character there. This algorithm is better than a linear
search since it doesn't have look at every character. You can modify it
to search backwards as well.

3) Store other attributes besides length along with each node. For
example, if each node knows how many line-feeds are in it, you could
search for lines in log2 time as well.

Perhaps you could provide hooks for allowing users to manage their own

4) Use skip-lists for the data structure instead of a binary tree: I think
it could work- you can search by accumulating forward-element sizes just as
well as you can by accumulating left sub-tree sizes. The advantage of
skiplists is that they're self-balancing.
/* ( */ /* Joseph H. Allen */

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.