Re: Lexing Unicode strings?

"Johann 'Myrkraverk' Oskarsson" <>
Mon, 3 May 2021 19:58:20 -0400 (EDT)

          From comp.compilers

Related articles
Lexing Unicode strings? (Johann 'Myrkraverk' Oskarsson) (2021-04-21)
Re: Lexing Unicode strings? (Johann 'Myrkraverk' Oskarsson) (2021-05-03)
Re: Lexing Unicode strings? (gah4) (2021-05-04)
Re: Lexing Unicode strings? (Christopher F Clark) (2021-05-04)
Re: Lexing Unicode strings? (gah4) (2021-05-04)
| List of all articles for this month |

From: "Johann 'Myrkraverk' Oskarsson" <>
Newsgroups: comp.compilers
Date: Mon, 3 May 2021 19:58:20 -0400 (EDT)
Organization: Compilers Central
Injection-Info:; posting-host=""; logging-data="75421"; mail-complaints-to=""
Keywords: lex, i18n
Posted-Date: 03 May 2021 19:58:20 EDT

On 21/04/2021 4:20 pm, Johann 'Myrkraverk' Oskarsson wrote:

> Dear c.compilers,

> For context, I have been reading the old book Compiler design in C
> by Allen Holub; available here


> and it goes into the details of the author's own LeX implementation.


> [The obvious approach if you're scaning UTF-8 text is to keep treating the input as
> a sequence of bytes. UTF-8 was designed so that no character representation is a
> prefix or suffix of any other character, so it should work without having to be clever
> -John]

That's not always feasible, nor the right approach. Let's consider the
range of all lowercase greek letters. In the source file, that range
will look something like [\xCE\xB1-\xCF\x89] and clearly the intent is
not to match the bytes \xCE, \xB1..\xCF, and \x89.

There is also the question of validating the input. It seems more
natural to put the overlong sequence validator, and legal code point
validator into the lexer, rather than preprocess the source file.

But maybe that's "premature optimization?"

Utf-8 has more problems too. There are bytes that can only appear after
other bytes, and detecting that early would be nice[0]; there is also the fact
that some glyphs can be constructed with a single code point, or two.

Also, Unicode input is not always utf-8.

When Holub is constructing his state machines for the regular
expressions, he uses bitmasks for the ranges[1]. ASCII, being 128
bits in a mask, is reasonable; even 256 bits for 8bit characters.

0x11000 bits in a mask feels excessive to me, though.

I was maybe naive, when I thought Unicode was considered a
"solved problem." I guess everyone is using either libraries like
ICU or run time environments. These can have the following
"problem:" Invalid input may not be signalled to the higher level
code, but quietly replaced with the replacement symbol, U+FFFD.
Which is, in my opinion, not good for a compiler.

[0] This can be part of output lexer tables, and is harder to
do manually when bootstrapping the lexer by hand.

[1] His implementation of sets.
[I still think doing UTF-8 as bytes would work fine. Since no UTF-8 encoding
is a prefix or suffix of any other UTF-8 encoding, you can lex them
the same way you'd lex strings of ASCII. In that example above, \xCE,
\xB1..\xCF, and \x89 can never appear alone in UTF-8, only as part of
a multi-byte sequence, so if they do, you can put a wildcard . at the
end to match bogus bytes and complain about an invalid character. Dunno
what you mean about not always UTF-8; I realize there are mislabeled
files of UTF-16 that you have to sort out by sniffing the BOM at the
front, but you do that and turn whatever you're getting into UTF-8
and then feed it to the lexer.

I agree that lexing Unicode is not a solved problem, and I'm not
aware of any really good ways to limit the table sizes. -John]

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.