Related articles |
---|
Lexing Unicode strings? johann@myrkraverk.com (Johann 'Myrkraverk' Oskarsson) (2021-04-21) |
Re: Lexing Unicode strings? johann@myrkraverk.com (Johann 'Myrkraverk' Oskarsson) (2021-05-03) |
Re: Lexing Unicode strings? gah4@u.washington.edu (gah4) (2021-05-04) |
Re: Lexing Unicode strings? christopher.f.clark@compiler-resources.com (Christopher F Clark) (2021-05-04) |
Re: Lexing Unicode strings? gah4@u.washington.edu (gah4) (2021-05-04) |
Re: Lexing Unicode strings? haberg-news@telia.com (Hans Aberg) (2021-07-14) |
From: | Hans Aberg <haberg-news@telia.com> |
Newsgroups: | comp.compilers |
Date: | Wed, 14 Jul 2021 15:39:25 -0400 (EDT) |
Organization: | A noiseless patient Spider |
References: | 21-05-001 |
Injection-Info: | gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="96033"; mail-complaints-to="abuse@iecc.com" |
Keywords: | lex, i18n |
Posted-Date: | 14 Jul 2021 15:39:25 EDT |
In-Reply-To: | 21-05-001 |
Content-Language: | en-US |
On 2021-05-04 01:58, John Levine wrote:
> [I still think doing UTF-8 as bytes would work fine. Since no UTF-8 encoding
> is a prefix or suffix of any other UTF-8 encoding, you can lex them
> the same way you'd lex strings of ASCII. In that example above, \xCE,
> \xB1..\xCF, and \x89 can never appear alone in UTF-8, only as part of
> a multi-byte sequence, so if they do, you can put a wildcard . at the
> end to match bogus bytes and complain about an invalid character. Dunno
> what you mean about not always UTF-8; I realize there are mislabeled
> files of UTF-16 that you have to sort out by sniffing the BOM at the
> front, but you do that and turn whatever you're getting into UTF-8
> and then feed it to the lexer.
>
> I agree that lexing Unicode is not a solved problem, and I'm not
> aware of any really good ways to limit the table sizes. -John]
I wrote code, in Haskell and C++, that translates Unicode character
classes into byte classes. From a theoretical standpoint, a Unicode
regular language mapped under UTF-8 is a byte regular language, so it is
possible. So the 2^8 = 256 size tables that Flex uses is enough. The
Flex manual has an example how to make a regular expression replacing
its dot '.' to pick up all legal UTF-8 bytes.
Return to the
comp.compilers page.
Search the
comp.compilers archives again.