String formats (Stavros Macrakis)
6 Nov 91 20:08:09 GMT

          From comp.compilers

Related articles
Different string format options, benefits? (Sean C. Cox) (1991-10-16)
Re: Different string format options, benefits? (1991-11-05)
String formats (1991-11-06)
Re: String formats (1991-11-08)
Re: String formats (1991-11-11)
| List of all articles for this month |

Newsgroups: comp.compilers
From: (Stavros Macrakis)
Keywords: code, optimize
Organization: OSF Research Institute
References: 91-10-061 91-11-015
Date: 6 Nov 91 20:08:09 GMT

In article 91-10-061 (Sean C. Cox) writes:
      >I am curious about the benefits/costs related to the two general character
      >string formats [null terminated vs. counted]

There are actually many more useful and general string representations
than the two you cite. It's useful to think about what operations you
plan to perform on the strings:

-- Static creation vs. dynamic creation
-- Read-only vs. modifiable
-- Nature of modifications: insert/delete? append? prepend?
-- Many common substrings?
-- Character-by-character operations?
-- Serial or random access to characters?
-- The nature of the string base itself.
-- Search or lookup within or among strings?

You also need to think about the machine you're running on:

-- Indexing capabilities
-- Page locality important?
-- Sub-byte operations fast?
-- Byte operations fast?
-- Library available for some existing representation?

To choose some extreme examples:

If you're doing static creation of read-only strings with serial access
only in a very small memory, you may want to store strings in some sort of
compressed form (e.g. Huffman), as long as you have reasonable shift
operations etc. as needed for sub-byte manipulation.

If you're doing a lot of insert/delete with random access to dynamically
created strings, you may want gapped strings (a la Gnu Emacs buffers).

If you're doing dictionary lookup, you may want a trie representation
(since you can share prefixes).

If you do a lot of string splicing, doubly-linked lists may be best.

etc. etc.

Null terminated vs. counted are far from the only reasonable
representations. Reasonable programming languages let you define your own
representations while allowing higher-level operations to be oblivious to
that representation. This allows you to change representation based on
tuning data and architectural considerations on each installation.

Building a string representation into a program, much less a language, is
poor software engineering practice. Indeed, you often need more than one
in a program. For instance, Gnu Emacs uses both counted strings and
gapped strings (for buffers) -- null-terminated strings are only used at
initialization (where they are converted to counted strings) and to
interface to Unix libraries.

Languages do need to make some sort of choice of string representation for
string literals (although even there you could imagine mechanisms for user
control), but in general they should allow the programmer to use the
string representation appropriate for his/her application.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.