Related articles |
---|
Why is flex pattern-matching of NULs slow? costello@mitre.org (Roger L Costello) (2022-04-08) |
Why is flex pattern-matching of NULs slow? christopher.f.clark@compiler-resources.com (Christopher F Clark) (2022-04-09) |
From: | Christopher F Clark <christopher.f.clark@compiler-resources.com> |
Newsgroups: | comp.compilers |
Date: | Sat, 9 Apr 2022 21:40:45 +0300 |
Organization: | Compilers Central |
References: | 22-04-001 |
Injection-Info: | gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="26217"; mail-complaints-to="abuse@iecc.com" |
Keywords: | lex, i18n, comment |
Posted-Date: | 09 Apr 2022 16:19:41 EDT |
I haven't looked at Flex in a while either, but what I remember is
that 0 is used as end of buffer and EOF indication and that you had to
validate against that. I don't recall whether that required an
attempt at reading or not. It wouldn't surprise me if it were used as
a flag also, and for a "null pointer". Depending upon how you look at
it, C either hates 0 or loves it, but it is very often "special".
But if you are parsing human readable ASCII text, having 0 (NUL) be an
EOF mark is actually not a bad solution. If I recall correctly, that
isn't even a bad choice for human readable UTF-8 (including
non-latin-1 texts, because 2 and 3 byte sequences don't have NULs in
them). It only becomes a pain if you want to parse binary data.
By the way, in our lexer, we used -1, i.e. what getc used to return
for EOF for the same condition and I don't recall how we put it in the
buffer (or whether we even did). Being ex-PL/I and Pascal
programmers, we used strings with lengths in many places instead of C
strings. I don't remember whether we used Paul Abrahams clever hack
to put the length at the end of the string which if done right also
serves as a null byte for use as C strings.
--
******************************************************************************
Chris Clark email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------
[You're right about UTF-8, where NUL is also a reasonable string terminator.
UTF-8 is self-synchonizing -- the bytes of no UTF-8 code point are a prefix
or suffix of any other code point. -John]
Return to the
comp.compilers page.
Search the
comp.compilers archives again.