re: C multibyte character encoding

derek@knosof.uucp (Derek M Jones)
Tue, 12 Apr 1994 23:04:08 GMT

          From comp.compilers

Related articles
C multibyte character encoding irwin@Thinkage.On.CA (Irwin Naumann) (1994-04-07)
Re: C multibyte character encoding bruce@jise.isl.melco.co.jp (Bruce Hahne) (1994-04-12)
re: C multibyte character encoding derek@knosof.uucp (1994-04-12)
| List of all articles for this month |
Newsgroups: comp.compilers
From: derek@knosof.uucp (Derek M Jones)
Keywords: i18n, C, comment
Organization: Compilers Central
References: 94-04-044 94-04-072
Date: Tue, 12 Apr 1994 23:04:08 GMT

Irwin Naumann <irwin@Thinkage.On.CA> writes:
>My current task is to implement the wide character and multibyte character
>routines (mblen, mbstowcs, mbtowc, wcstomcs, and wctomb) for our C runtime
>library.


Being ahead of the game you will of course also want to implement
Amendment 1 to the C standard (current in PDAM {Proposed Draft Amendment}
form).


Creating an ASCII implementation is not hard. It is the other languages
that people choose to speak that makes life difficult.


>We have the Unicode documents that describe one method for wide character
>encoding. I cannot find any literature on multibyte character encoding.


There are two issues, 1) the external, printable, representation and 2)
the internal (Unicode is a very good way to go) representation.


>I have a few questions regarding multibyte (mb) encoding. Does each
>software development company have its own multibyte character encoding ?


If they are big and Japanese the answer is Yes.


>Is a specific company's multibyte character encoding proprietary or
>covered by copyright ?


Yes. But I have not heard of anybody being sued (the Japanese don't do
that sort of thing).


>Are public domain versions of multibyte character encoding available ?


The problem is not the availability, but convincing your customers to
switch.


>[My impression is that the most common multi-byte encoding is the shifted
>character sets used in the European ISO (ASCII-like) character codes. -John]


Shift encoding is very common.


For a good discussion of the issues see "Understanding Japanese
Information Processing" by Ken Lunde, pub O'Reilly ISBN 1-56592-043-0.
This book ought really to be called "Understanding Japanese Character
Processing".


At the moment C supports multi-byte characters in strings, character
constants and comments. Moves are afoot to support multi-byte
identifiers. This issue was last discussed two years ago at the WG14 (ISO
C) meeting in Tokyo. It is now coming back onto the agenda again.


There seem to be two ways of supporting multi-byte identifiers:


      1) Have a '-Multi_byte character_set' option. The compiler then
            knows what additional characters may occur and how to process them.
            For: enables user friendly implementations, Against: reduces
            code portability.


      2) Have some sort of magic character that causes the compiler to
            treat everything that follows it as being part of a multi-byte
            character.
            For: Keeps code portable, Against: Makes it difficult for the
            compiler to do meaningful, user types things with identifiers.


Does anybody have any thoughts?


derek jones


ps. Plug for C standards work.


The following people should be able to help you obtain copies
of the C Addendum. Their addresses might be back to front for
some of you.


derek@knosof.co.uk UK, next meeting 12 May at BSI in London.
keie@cs.vu.nl Netherlands
keld@dkuug.dk Denmark
m.noda@xopen.co.uk Japan, noda@swp.bsd.mt.nec.co.jp may also work
rex@aussie.com US
[The JCLT has covered this in fair detail as well. -John]
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.