Re: Lexing Unicode strings?

gah4 <gah4@u.washington.edu>
Tue, 4 May 2021 14:47:08 -0700 (PDT)

          From comp.compilers

Related articles
Lexing Unicode strings? johann@myrkraverk.com (Johann 'Myrkraverk' Oskarsson) (2021-04-21)
Re: Lexing Unicode strings? johann@myrkraverk.com (Johann 'Myrkraverk' Oskarsson) (2021-05-03)
Re: Lexing Unicode strings? gah4@u.washington.edu (gah4) (2021-05-04)
Re: Lexing Unicode strings? christopher.f.clark@compiler-resources.com (Christopher F Clark) (2021-05-04)
Re: Lexing Unicode strings? gah4@u.washington.edu (gah4) (2021-05-04)
| List of all articles for this month |

From: gah4 <gah4@u.washington.edu>
Newsgroups: comp.compilers
Date: Tue, 4 May 2021 14:47:08 -0700 (PDT)
Organization: Compilers Central
References: 21-05-001 21-05-002
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="94587"; mail-complaints-to="abuse@iecc.com"
Keywords: i18n, comment
Posted-Date: 04 May 2021 18:55:59 EDT
In-Reply-To: 21-05-002

On Tuesday, May 4, 2021 at 8:33:50 AM UTC-7, gah4 wrote:


(snip, I wrote)


> Note that in addition to have a 16 bit Unicode char, the Java language
> itself is defined in terms of Unicode. Variable names can be any Unicode
> letter, followed by Unicode letters and digits. Presumably, then, the
> designers of Java compilers have figured this out, I suspect using the 16 bit char.


(snip)


> Yes, Unicode can be fun!
> [Remember that Unicode is a 20 bit code and for characters outside the first 64K,
> Java's UTF-16 uses pairs of 16 bit chars known as surrogates that make UTF-8
seem clean and beautiful. -John]


I did know that Java used 16 bits, but never tried to figure out what they did
with the rest of the characters. There should be enough in the first 64K for writing
programs.


I did once use π for a variable name, with the obvious value. It seems it is \u03c0.
I even found an editor that allowed entering such characters, and then would write
out the file with \u escapes. As far as I know, that is more usual than UTF-8.


I believe that the Java parser converts from \u escapes fairly
early, such that you can quote strings with \u0022, and then you
should be able to put \uu0022 inside the strings.


[If you're only going to allow the lower 64K, your users will be sad
when they try to use quoted strings with uncommon Chinese characters
or with emoji, or more likely your compiler will barf since they will
be encoded as two surrogate characters and your lexer won't know what
to do with them. If you're going to deal with Unicode, better bite the
bullet and deal with the whole mess. -John]


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.